AI as an accelerator, not a replacement: workflow, code quality, and ownership

Michel Mix · Medellín, Colombia · May 7, 2026

Summary

This literature synthesis examines which workflow principles are associated with productivity, code quality, and human ownership in AI-assisted software development. The practical conclusion is clear: AI coding assistants can speed up developers, but mainly when they are embedded in an explicit workflow with sharp task scoping, strong specification, controlled generation, structured verification, quality assurance, and human review. At the same time, the literature shows why unfocused AI use can be misleading. Developers often feel faster, while verification, adaptation, technical debt, and loss of ownership can reduce or even reverse the gain. The strongest workflow is therefore not “more AI,” but disciplined AI use inside normal engineering controls.

Key Points

Positioning

This text is a literature synthesis for a presentation on AI-assisted software development. It is not a full systematic review and not a product comparison between AI tools. The synthesis connects empirical studies, preprints, industrial cases, and conceptual papers into a practical line of reasoning for development teams.

Research Question

Which workflow principles are associated with productivity, code quality, and human ownership in AI-assisted software development?

Literature Synthesis

Introduction

AI coding assistants have quickly become common in software development. GitHub Copilot, ChatGPT, Claude, Cursor, and their autonomous successors are no longer used only experimentally: they now write more than 20 percent of new code in organizations such as Google and Microsoft, while the Stack Overflow Developer Survey 2025 reports that 84 percent of professional developers use or plan to use AI coding tools (cited in Liu et al., 2026). Yet their effects on productivity, code quality, and professional ownership are far from uniformly positive, and they depend strongly on how developers use AI.

The central thesis of this synthesis is that the question of whether AI helps is less useful than the question of how it is used. The selected empirical and conceptual studies show that there is a coherent set of workflow principles that can enable productivity gains, protect code quality, and preserve human ownership. At the same time, the same studies show that ignoring these principles can lead to hidden time costs, persistent technical debt, and a gradual erosion of professional accountability.

This synthesis is organized thematically. It first discusses the productivity picture, then the five workflow principles: task fit, specification before generation, explicit verification, structural quality assurance, and human ownership. It then treats code review as a critical workflow node and identifies the organizational conditions that shape adoption. The text closes with an integrated, directly usable workflow and the boundaries within which the conclusions are strongest.


Productivity: a nuanced and context-dependent picture

The productivity effects of AI coding assistants are positive on average, but unevenly distributed and partly illusory if the broader picture is ignored. This is the most consistent finding in the literature, and also the most easily misunderstood.

The strongest positive numbers come from the RCT study by Cui et al. (2025), which followed 4,867 developers at Microsoft, Accenture, and a Fortune 100 company: AI assistance led to an average of 26.08 percent more completed tasks, 13.55 percent more commits, and 38.38 percent more compilations. Less experienced developers benefited the most. Similar, though more modest, gains were found by Song et al. (2024) in open-source projects: Copilot use led to a net 5.9 percent increase in project-level code contributions. Mohamed et al. (2025) confirm in their systematic review of 39 studies that the majority of studies report benefits, especially faster routine work and less time spent searching for syntax and boilerplate.

However, the same literature contains a striking counter-conventional finding that tempers the optimism. Becker et al. (2025) conducted a within-subjects RCT with 16 experienced open-source contributors working on 246 tasks in their own familiar projects; the authors acknowledge risks of experimental artifacts, but consider the main result robust. The result was a 19 percent increase in task completion time with AI use, while participants themselves had predicted a 24 percent improvement. This gap between perception and reality has been identified in several studies (Liang et al., 2024; Mozannar et al., 2024), and it creates a serious risk: teams that think they are working faster while they are not have little basis for critically evaluating their workflow.

The explanation lies in what Mohamed et al. (2025) call the multidimensionality of productivity, and what Chen et al. (2026) develop empirically through a mixed-methods study at BNY Mellon with 2,989 survey responses and 11 interviews. Chen et al. identify six productivity factors, of which technical expertise and ownership of the work are among the most underestimated. Short-term metrics such as commits and completed tasks do not measure whether a developer actually understands the code, can maintain it, or can defend it with confidence. Fan et al. (2026) add another dimension: verification load. Their experiment with 60 participants shows that AI use reduced working time by an average of 22 percent, but that a verification-load index (failed compilations, time-to-first-compile, code reversals, pauses, and interaction interruptions) accumulated across tasks and partly explained the increase in stress and fatigue.

The boundary of productivity gain is therefore not arbitrary. Becker et al. (2025) show that it shifts with familiarity with the codebase: in mature, complex projects, verification and adaptation costs can outweigh the writing-time gains of generated code. Song et al. (2024) add a collective dimension: AI increases individual contributions, but also increases coordination time by 8 percent. The net effect at project level remains positive according to the authors, but smaller than purely individual measures would suggest. In short: the question is not whether AI produces productivity gains, but for whom, in which project phase, and under which task allocation — and whether the gain is realized without creeping quality and ownership costs.


Workflow Principle 1: task fit

The first and most fundamental conclusion from the literature is that not all tasks are equally suitable for AI assistance, and that consciously matching task type to AI use maximizes productivity gains while limiting quality risk.

Barke et al. (2023) provide the foundation for this principle through a grounded theory analysis of 20 programmers working with GitHub Copilot. They distinguish two qualitatively different interaction modes. In acceleration mode, the developer knows exactly what they want to achieve and uses AI as a speed-up mechanism: suggestions are quickly evaluated and accepted or rejected. In exploration mode, the developer is uncertain about the direction and uses AI to explore options, which can help but can also slow the work down. The crucial point is that each mode requires a different verification strategy: acceleration mode can often rely on quick inspection, while exploration mode requires deeper evaluation.

Sergeyuk et al. (2024) provide empirical insight into task fit based on a survey of 481 JetBrains users. Developers use AI assistants across activities such as new features, tests, bug triage, refactoring, and natural-language artifacts, but concrete tasks such as test generation and docstrings appear more suitable than tasks that require broad project context. This aligns with Li et al. (2024), whose DevEval benchmark shows that current LLMs have significant shortcomings in early SDLC phases such as system design, environment setup, and acceptance testing, while performing best on isolated implementation tasks.

Practice confirms the pattern. Li et al. (2026) analyzed 2,547 ChatGPT conversations from GitHub and found that most interactions consisted of one to three turns: task-focused, narrow in scope, and aimed at code generation, code modification, or problem resolution. Vigh et al. (2026) confirm through interviews with 11 developers that AI use is conditional and selective: developers deliberately avoid AI for security-critical, system-defining, or high-accountability tasks. This is not reluctance; it is professional judgment. Responsibility for correctness and safety was explicitly treated by these developers as non-transferable.

The workflow principle that follows is that task fit must be decided up front, not treated as a vague guideline. Is this an acceleration task or an exploration task? Is the risk level low enough for AI delegation? Is there enough context to evaluate an AI suggestion quickly? Russo (2024) shows that this is not an academic abstraction: in a mixed-methods adoption study (survey of 100 software developers, PLS-SEM validation with 183 respondents), workflow compatibility was the strongest driver of AI adoption, stronger than perceived usefulness or social pressure. Tools that disrupt the existing workflow are not adopted sustainably; tools that fit into it are.


Workflow Principle 2: specification before generation

The second conclusion is that AI code generation performs substantially better when intent, context, and architectural constraints are made explicit before the model starts working. This principle runs against the most common AI interaction pattern, where developers immediately state a task and hope for a usable suggestion.

Mu et al. (2024) provide the clearest empirical demonstration with ClarifyGPT: a framework that detects ambiguities in requirements, asks targeted clarification questions to the user, and only then generates code. In a human-validated evaluation on MBPP-sanitized, GPT-4’s Pass@1 rose from 70.96 percent to 80.80 percent through this explicit clarification step. The reason is simple: LLMs guess when input is unclear, and that guess is wrong more often than users realize.

Ullrich et al. (2025) deepen this point through interviews with 18 practitioners from 14 companies. Their finding is that requirements as they are typically documented are too abstract for direct LLM input. Developers must manually decompose them into concrete programming tasks, enriched with design decisions, architectural constraints, and relevant context. This is not a time-consuming detour; it is the core requirements-engineering work that was always necessary, but now also becomes explicit prompt preparation. The implication is clear: ownership of the specification is the most critical human contribution to the AI workflow, and skipping it almost certainly leads to suboptimal output.

Mallya et al. (2025) add a practical nuance from requirements engineering: LLMs can classify user feedback and generate requirement specifications, but performance remains moderate and requires human validation. AI can therefore be useful as a first filter for raw input from tickets, reviews, or feedback channels, but not as a replacement for the human specification step. Before code is generated, the developer must check whether functional intent, non-functional requirements, and priority have been translated correctly.

Liang et al. (2024) support this from a usability perspective. In their survey of 410 developers, the most common reasons for rejecting AI suggestions were that the output did not satisfy non-functional requirements and that it was hard to steer. Both objections are manifestations of the same problem: too little context and direction in the prompt. As adjacent evidence from broader GenAI literature, Gerlich (2025) shows in a cross-national experiment (n=150) that structured prompting, in which users are forced to formulate their question methodically, increases AI output quality and reduces cognitive offloading while improving critical reasoning. Unstructured AI use increases the perception of productivity, but not reasoning quality or output quality.

Sarkar et al. (2024) summarize the principled consequence: AI interfaces must be designed to preserve critical thinking under increasing delegation. Their proposed mechanism is provocations: short textual objections that critique AI output and point to alternatives. Such mechanisms do not arise automatically from the interaction; they must be explicitly designed as interface elements. The primary risk of omitting them is not hallucination, but cognitive delegation: the gradual movement of critical thinking from the developer to the model, without the developer noticing or intending it.


Workflow Principle 3: verification as an explicit workflow step

That AI output requires verification is obvious. What the literature adds, and what practice often underestimates, is that verification has substantial time cost, that its quality can decline cumulatively through fatigue, and that it must be actively designed into the workflow instead of being performed ad hoc.

Mozannar et al. (2024), using the CUPS taxonomy (Code Understanding and Participation Structures), showed how 21 programmers actually spent their time while working with GitHub Copilot. The results are sobering: a large share of interaction time went into reading and evaluating AI suggestions, not writing or problem solving. Productivity gains were therefore smaller than users expected. Fan et al. (2026) quantify this further in a controlled experiment: they introduce the verification-load index, a composite measure of failed compilations, time to first compile, code reversals, and interaction interruptions. The index shows that verification load accumulates across tasks, leading to fatigue and statistically detectable shallower checking.

The practical consequence is illustrated most sharply by Becker et al. (2025). Their RCT showed that experienced developers in familiar codebases took 19 percent longer on tasks with AI than without it. The cause was not that the AI was simply bad, but that developers spent significant time evaluating, adapting, and repairing suggestions that were directionally correct but lacked context. This is the mature-project problem: conventions, architectural patterns, technical debt, and implicit knowledge are not fully available to the AI, and the developer must repeatedly compensate for that.

Liu et al. (2026) show what happens when verification is insufficient. In their large-scale analysis of 302,600 AI-generated commits across 6,299 GitHub repositories, they identified 484,366 individual issues, of which 22.7 percent remained in the latest repository version. Code smells were the most prevalent issue type (89.3 percent of all issues). Every AI tool studied introduced at least one issue in more than 15 percent of commits. Agarwal et al. (2026) find similar patterns for autonomous coding agents: static-analysis warnings increased by 18 percent and cognitive complexity by 39 percent, and these effects persisted after the velocity gains of the agents had faded.

The workflow principle that follows is that verification does not happen well by itself, and should not be delayed until pressure is high. Adalsteinsson et al. (2025) show that AI-assisted code review is more effective when review context is explicitly assembled through a RAG pipeline, so the AI output becomes more relevant and the human reviewer has to compensate less. Vijayvergiya et al. (2024) show at Google that AI can work as a first filter for coding conventions, allowing human reviewers to focus on logic and architecture. Fan et al. (2026) recommend adaptive interface selection based on task difficulty: inline AI for simple tasks, chat-based AI for higher complexity, and structured prompts for beginners.

In short: verification must be designed — planned as a step, equipped according to task difficulty and risk, and deliberately bounded in time to limit fatigue.


Workflow Principle 4: structural quality assurance

The fourth conclusion is that AI use makes structural quality assurance through static analysis, tests, and coding standards more necessary, not less. Multiple studies show that AI output consistently introduces quality problems that remain in the codebase without additional QA steps.

Liu et al. (2026) provide the most convincing empirical evidence here. In addition to the high persistence of issues (22.7 percent), they found that code smells were the dominant problem type: a fundamental maintainability risk that appears later as higher change cost. Sun et al. (2026) approach the same issue from another angle: after analyzing 109 scientific papers, conducting two industrial workshops, and running an empirical patch study, they conclude that there is systematic misalignment between academic priorities (security, performance), industrial priorities (maintainability, readability), and actual LLM behavior. Non-functional quality characteristics (NFQCs) are insufficiently addressed in all three domains. Practitioners fear technical-debt accumulation but lack suitable QA mechanisms for AI-generated code.

Yu et al. (2026) make this conclusion concrete at the organizational level. In their multi-case study of three industrial GenAI projects, teams mainly used context-specific, task-oriented quality metrics, because generic research metrics rarely map directly to operational value. For AI-generated code, this means quality assurance cannot stop at a generic test or lint step. Teams must explicitly determine which quality criteria matter for each codebase and task type: maintainability, error handling, performance, security, explainability, or domain-rule alignment.

The idea that prompting solves the problem is attractive, but not supported by the data. Della Porta et al. (2025) studied 7,583 code files from the Dev-GPT dataset across three quality attributes (maintainability, security, reliability) and three prompt patterns (zero-shot, chain-of-thought, few-shot). A Kruskal-Wallis test found no significant quality differences between the prompt patterns, including in existing uncontrolled project code. The prompt pattern does not determine code quality; quality assurance must be organized elsewhere in the workflow.

Static analysis combined with LLMs offers a more promising direction. Patcas and Motogna (2026) evaluated six LLMs on solving SonarQube-reported quality issues and found an average reduction of 36 percent, with the best model (Grok 3) reaching 71.54 percent on one project. Wadhwa et al. (2024) developed CORE, a proposer-ranker LLM pipeline that fixes static-analysis issues and approximates human review criteria; 59.2 percent of Python files passed both the tool and human review, while 76.8 percent of Java files passed static analysis. Berabi et al. (2024) show that targeted AI repair of security vulnerabilities, where program analysis bounds the LLM context, performs better than unguided generation. The common pattern is clear: static analysis guides AI; AI does not replace static analysis. Simões and Venson (2024) confirm that LLM quality evaluation is complementary to static analysis, not a replacement: their comparison of GPT models with SonarQube shows substantial model variability and emphasizes that human end standards remain necessary.

Security-specific quality assurance is even more urgent. Tessa et al. (2026) conducted an adversarial audit of three state-of-the-art secure code generation methods and found that static analyzers overestimated security by 7 to 21 times, that 37 to 60 percent of outputs classified as “secure” were non-functional, and that under adversarial conditions, actually secure-and-functional output dropped to 3 to 17 percent. Security assurance cannot be delegated to an AI model, even to models designed for that purpose. Tony et al. (2024) confirm from a complementary angle that structured prompting techniques, especially Recursive Criticism and Improvement (RCI), reduce security weaknesses, but external security validation remains necessary.

Haroon et al. (2026) complete the quality picture for test generation. LLMs initially reach 79 percent line coverage, but under semantic code changes (SAC), the pass rate of newly generated tests falls to 66 percent and branch coverage to 60 percent. Tests are shallowly coupled to the original code and respond to lexical rather than semantic changes. AI-generated test suites require human re-evaluation after every significant code change to preserve regression coverage.

The workflow principle is that quality assurance is a structural and multidimensional workflow step, not an optional add-on. Concretely: run static analysis such as SonarQube or CodeQL on AI-generated code before review; validate security-sensitive code with a dedicated tool in addition to the LLM; periodically reassess AI-generated test suites after code changes; and explicitly state NFQCs such as maintainability, readability, and performance as acceptance criteria.


Workflow Principle 5: human ownership

The fifth conclusion is that human ownership in AI-assisted software development is not automatic; it must be actively preserved. Without explicit attention, the mechanisms that carry ownership — personal standards, professional integrity, critical thinking, and social reciprocity — erode under AI delegation.

Alami and Ernst (2024) provide the conceptual basis for this principle in their interview study with 12 software developers. They distinguish two forms of accountability: institutional accountability (formal processes, measurable KPIs) and grassroots accountability (intrinsic motivation, reputation, peer norms). Grassroots accountability cannot be transferred to a system: it rests on reciprocity, pride in quality, and expectations from peers. Alami et al. (2025) show in a follow-up study (16 interviews and focus groups with AI-review simulations) that LLM-assisted code review disrupts precisely this reciprocity: developers do not feel accountable to an AI in the way they feel accountable to a colleague. This directly affects the quality standards they apply to AI-generated or AI-reviewed code.

Sarkar et al. (2024) describe the structural risk of increasing AI delegation as “copilot-to-autopilot drift”: the gradual movement of critical thinking from the developer to the model, without a conscious decision. They argue that hallucinations are less dangerous than this creeping loss of cognitive engagement, and propose provocations as interface-level mechanisms that force users to think before accepting AI output. Gerlich (2025) supports this from broader GenAI literature: structured AI use, where users are forced to formulate questions methodically, significantly reduces cognitive offloading and increases critical reasoning compared with unstructured use.

Developers intuitively recognize this boundary. Vigh et al. (2026) show that developers use AI selectively and deliberately avoid it for high-accountability tasks such as security-critical work, system definition, and architecture decisions. Responsibility for correctness and safety was treated by nearly all participants as non-transferable. This is not reluctance caused by unfamiliarity with AI; it is professional judgment.

Ogenrwot and Businge (2026) provide quantitative evidence that developers also enact this ownership in practice: in their analysis of 338 GitHub pull requests with self-disclosed ChatGPT use, the median integration rate of AI patches was only 25 percent. Most integrated patches were selectively modified, extracted, or iteratively refined. Full adoption was the exception. Watanabe et al. (2026) find similar patterns for autonomous coding agents (Claude Code): 83.8 percent of agent PRs were merged, but 45.1 percent required human revision, especially for bugs, documentation style, and project-specific standards.

The workflow principle is twofold. First, developers must consciously set boundaries on AI delegation based on accountability and risk. Second, workflow infrastructure must support those boundaries: iterative human judgment should be the norm, not the exception; peer review should remain the social quality and ownership gate, supported but not replaced by AI review; and teams should cultivate critical engagement with AI output rather than rapid acceptance.


Code review as a critical node

Code review deserves separate attention because it is the most studied and most loaded node in the AI-assisted development workflow. It is the moment where productivity, code quality, and ownership meet, and where the tension between AI assistance and human accountability is most visible.

Bacchelli and Bird (2013) provide the baseline for understanding all AI review interventions. In their empirical study at Microsoft (17 developers, 16 teams, 873 programmers), modern code review was shown to be much more than defect detection: it is a primary mechanism for knowledge transfer, team-led quality standards, awareness of each other’s code, and shared ownership. This social function cannot be replaced by an automated system, regardless of accuracy. A review by a colleague carries value beyond the comments placed; it also carries the reciprocal responsibility that Alami et al. (2025) identify as central to professional accountability.

This does not exclude AI assistance in code review, but it limits the role AI should play. Vijayvergiya et al. (2024) show at Google that AI (Autocommenter) can function effectively as a first filter for coding-practice deviations, allowing human reviewers to focus on logic, semantics, and architecture. Adalsteinsson et al. (2025) show at WirelessCar that a RAG-based AI review prototype, which retrieves relevant contextual information before review, was received more positively than context-free AI review, and that preference for AI review depended on the reviewer’s familiarity with the codebase. Wadhwa et al. (2024) developed CORE, a pipeline where static analysis guides the LLM and a ranker LLM approximates human acceptance criteria. Peng et al. (2025) achieved an 84 percent acceptance rate in production at Huawei through a mixture-of-prompts architecture for security-focused review.

Cihan et al. (2025) align with this through their proposal for human-in-the-loop LLM code review: AI review as support for knowledge sharing and signaling, not as final judgment. Lin et al. (2024) add a training perspective: the quality of AI review comments depends on the quality of training data, and human expertise remains the normative standard. Experience-aware training of review models significantly improves output without additional data collection.

The workflow implication is that AI in code review always has an advisory role, never a deciding role. The reviewer retains final judgment, and the social practice of peer review, with its reciprocal accountability, remains intact. AI review can lower the cost of identifying technical issues such as style, security, and conventions, but it cannot take over the social quality function of human review. It can make human review more efficient: fewer trivial comments, more attention for substantive judgment.


Adoption and organizational conditions

The five workflow principles work only when the organizational context supports them. Three factors are decisive: workflow compatibility, trust and social support, and professional differentiation.

Russo (2024) identified workflow compatibility as the strongest adoption factor in a mixed-methods adoption study using the HACAF framework, stronger than perceived usefulness or social influence. AI tools that fit smoothly into existing work processes are adopted structurally; tools that disrupt the workflow are avoided, even if they are objectively strong. This has direct implications for introducing AI workflows in teams: the workflow must be integrated into existing practices and expanded gradually, not imposed as a total redesign.

Banh et al. (2025), through a sociotechnical adoption framework based on 18 software engineering expert interviews, emphasize that successful GenAI integration requires organizational alignment: clear governance over when and how AI is used, policy that supports trust and experimentation, and incremental integration that leaves room for adjustment. Sergeyuk et al. (2024) confirm that trust and company policy are two of the most frequently mentioned barriers to AI use, and that lack of project-wide context in AI assistants is a third technical barrier that shapes tool selection.

Shao and Ishengoma (2026) identify an important professional differentiation effect: professionals systematically apply refinement strategies to AI output (maintainability, architectural alignment), while students focus mainly on a working end product. This difference in ownership culture matters for teams adopting AI: without explicit norm-setting around quality refinement as an expected practice, the team drifts toward the lowest standard. Yu et al. (2026) confirm this in a multi-case study: the quality of GenAI applications in practice varies strongly and depends on organizational embedding and quality assurance. Hao et al. (2024) add transparency: public sharing of ChatGPT conversations in PRs and issues, a practice they analyzed across 580 GitHub threads, functions as a traceability artifact that makes AI use visible and reviewable.


Integrated conclusion

The studies converge on a coherent core: AI coding assistance produces demonstrable benefits, but only when used as a consciously governed, multi-step workflow — not as an ad hoc inspiration source or task outsourcing mechanism. The central paradox in the literature is that the largest productivity risks occur when AI appears to work best: fast suggestions that are accepted immediately can silently accumulate code smells, security problems, and ownership damage. The strongest protection is not less AI, but better workflow.

The five core principles that recur across the literature — task fit, specification first, explicit verification, structural quality assurance, and conscious ownership — are not independent tips but a connected system. Each principle strengthens the others. Good specification makes verification more efficient. Explicit verification protects ownership. Structural quality assurance makes task fit operational. Conscious ownership defines the boundaries within which the other principles operate. The seven steps below operationalize these principles for daily work.

The seven steps

The workflow below is based on the weighted evidence from the selected empirical and conceptual studies. It is intended as a daily working guide, not an idealized picture.


Step 1 — Task fit: decide before you start

Ask yourself two questions before opening AI:

  1. What kind of task is this?
    • Acceleration task: I know what I want; AI helps me move faster — direct AI input is appropriate.
    • Exploration task: I am unsure about direction — use AI to generate options, but evaluate more critically.
  2. What is the risk level?
    • Low: boilerplate, test generation, docstrings, refactoring isolated functions — AI delegation is relatively safe.
    • High: security-critical code, architecture decisions, compliance-related logic, system definition — limit AI use; human judgment is non-transferable.

Evidence: Barke et al. (2023), Vigh et al. (2026), Li et al. (2026), Sergeyuk et al. (2024), Li et al. (2024).


Step 2 — Specification: make intent and context explicit

Formulate before prompting:

For more complex requirements, ask the AI for clarification before asking for code: “Do you understand the intent? Ask questions before generating.”

If the input comes from tickets, user feedback, or support channels, use AI at most to cluster themes or draft an initial requirement specification. Then validate yourself whether functional intent, NFRs, and priority are correct before asking AI for implementation.

Evidence: Mu et al. (2024), Ullrich et al. (2025), Mallya et al. (2025), Liang et al. (2024), Gerlich (2025), Sarkar et al. (2024).


Step 3 — Generation: steer, do not trust blindly

Generate the code, but position the AI output as a draft, not final code.

Evidence: Li et al. (2026), Becker et al. (2025), Ogenrwot & Businge (2026).


Step 4 — Verification: structured and time-bounded

Verify AI output systematically:

Time-box verification. Set a time limit per session. Verification load increases with fatigue; take a break if you notice that you are checking more superficially.

Evidence: Mozannar et al. (2024), Fan et al. (2026), Becker et al. (2025), Adalsteinsson et al. (2025).


Always perform quality assurance after AI generation, regardless of how convincing the output looks:

Evidence: Liu et al. (2026), Agarwal et al. (2026), Sun et al. (2026), Patcas & Motogna (2026), Haroon et al. (2026), Tessa et al. (2026).


Step 6 — Code review: AI as first filter, humans as final gate

In the review phase, the roles are:

Author:

Reviewer:

Evidence: Bacchelli & Bird (2013), Alami et al. (2025), Vijayvergiya et al. (2024), Adalsteinsson et al. (2025), Cihan et al. (2025), Wadhwa et al. (2024).


Step 7 — Reflection: keep ownership active

At the individual level:

At team level:

Evidence: Sarkar et al. (2024), Gerlich (2025), Alami & Ernst (2024), Alami et al. (2025), Chen et al. (2026), Shao & Ishengoma (2026).


Practical Definition of Done

For daily team use, the workflow can be summarized as a minimal Definition of Done for AI-assisted work. An AI contribution is only done when:

This definition makes the synthesis practically applicable without losing the nuance of the literature: AI may speed things up, but only inside a workflow where specification, verification, quality criteria, and human responsibility remain explicit.


Boundaries of the conclusion

Every synthesis requires honesty about the boundaries of its conclusions. Three boundary conditions are relevant:

1. Experience level shapes the productivity balance. Less experienced developers benefit more strongly from AI (Cui et al., 2025), while experienced developers in familiar codebases can end up with net time loss (Becker et al., 2025). Step 1 (task fit) and Step 4 (verification time) are especially critical for experienced developers.

2. Project phase, task type, and maturity. In repositories where agents are the first observable AI tool, Agarwal et al. (2026) find mainly short-term velocity gains; in environments with prior AI tooling, this effect is weaker. In mature, large codebases with complex dependencies, quality risks and verification costs are highest. Bounded implementation tasks benefit earlier; legacy integration and architecture work benefit least (Li et al., 2024; Becker et al., 2025). Autonomous agents require the strongest quality assurance in mature projects.

3. Organizational context. Without governance, trust, and an explicit quality culture, individual workflow principles are insufficient (Russo, 2024; Banh et al., 2025). The workflow works best in a team that shares norms about AI use, quality standards, and ownership.


Summary: the seven steps at a glance

Step Principle Core finding Key sources
1 Task fit Not all tasks are suitable for AI; selectivity is professional practice Barke (2023), Vigh (2026)
2 Specification first Making context and intent explicit substantially improves quality Mu (2024), Ullrich (2025), Mallya (2025)
3 Steered generation AI output is a draft; short, task-focused sessions are more effective Li (2026), Becker (2025)
4 Explicit verification Verification costs more time than expected; run it structurally and time-bound Mozannar (2024), Fan (2026)
5 Quality assurance Static analysis and task-specific quality criteria are required after AI generation Liu (2026), Sun (2026), Yu (2026)
6 Code review: human final gate AI filters conventions; peer review remains essential for ownership and quality norms Bacchelli & Bird (2013), Alami (2025)
7 Active ownership Understanding, explainability, and selectivity are the standards of professional AI use Sarkar (2024), Ogenrwot & Businge (2026)

Sources Used and Role in the Synthesis

The table below shows which sources carry the main argument and which are mainly used for context or practice-based evidence.

Role in the synthesis Sources
Productivity, adoption, and organizational context Cui et al. (2025), Becker et al. (2025), Song et al. (2024), Mohamed et al. (2025), Chen et al. (2026), Russo (2024), Banh et al. (2025), Shao & Ishengoma (2026), Yu et al. (2026)
Task fit and interaction modes Barke et al. (2023), Sergeyuk et al. (2024), Li et al. (2024), Li et al. (2026), Vigh et al. (2026)
Specification, prompting, and requirements Mu et al. (2024), Ullrich et al. (2025), Mallya et al. (2025), Liang et al. (2024), Gerlich (2025), Sarkar et al. (2024), Tony et al. (2024)
Verification, quality assurance, and technical debt Mozannar et al. (2024), Fan et al. (2026), Liu et al. (2026), Agarwal et al. (2026), Sun et al. (2026), Patcas & Motogna (2026), Haroon et al. (2026), Tessa et al. (2026), Berabi et al. (2024), Wadhwa et al. (2024), Simões & Venson (2024), Della Porta et al. (2025)
Code review, accountability, and ownership Bacchelli & Bird (2013), Alami & Ernst (2024), Alami et al. (2025), Vijayvergiya et al. (2024), Adalsteinsson et al. (2025), Cihan et al. (2025), Lin et al. (2024), Peng et al. (2025)
Traceability and real-world AI use Hao et al. (2024), Ogenrwot & Businge (2026), Watanabe et al. (2026)

References

Adalsteinsson, G. H., et al. (2025). Rethinking code review workflows with LLM assistance. Proceedings of the 2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 488-497. https://doi.org/10.1109/ESEM64174.2025.00013

Agarwal, S., et al. (2026). AI IDEs or autonomous agents? Measuring the impact of coding agents. Proceedings of MSR 2026. https://doi.org/10.1145/3793302.3793589

Alami, A., & Ernst, N. A. (2024). Understanding the building blocks of accountability in software engineering. Empirical Software Engineering.

Alami, A., et al. (2025). Accountability in code review: The role of intrinsic drivers and the impact of LLMs. ACM Transactions on Software Engineering and Methodology, 34(8), 1-44. https://doi.org/10.1145/3721127

Bacchelli, A., & Bird, C. (2013). Expectations, outcomes, and challenges of modern code review. Proceedings of the 35th International Conference on Software Engineering (ICSE), 712-721.

Banh, L., et al. (2025). Copiloting the future: How GenAI transforms software engineering. Information and Software Technology, 183, 107751. https://doi.org/10.1016/j.infsof.2025.107751

Barke, S., James, M. B., & Polikarpova, N. (2023). Grounded Copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages, 7(OOPSLA1), 85-111.

Becker, J., et al. (2025). Measuring the impact of early-2025 AI on experienced OSS developer productivity. METR Technical Report. https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

Berabi, B., et al. (2024). DeepCode AI Fix: Fixing security vulnerabilities with large language models. ICML 2024.

Chen, V., et al. (2026). Beyond the commit: Developer perspectives on productivity with AI coding assistants. Proceedings of ICSE-SEIP 2026. https://doi.org/10.1145/3786583.3786848

Cihan, T., et al. (2025). Evaluating large language models for code review. arXiv:2505.20206.

Cui, Z., et al. (2025). The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers. SSRN Working Paper, No. 4945566. https://doi.org/10.2139/ssrn.4945566

Della Porta, J., et al. (2025). Do prompt patterns affect code quality? Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering, 181-192. https://doi.org/10.1145/3756681.3756938

Fan, G., et al. (2026). When help hurts: Verification load and fatigue with AI coding assistants. Proceedings of CHI 2026. https://doi.org/10.1145/3772318.3791176

Gerlich, M. (2025). From offloading to engagement: An experimental study on structured prompting and critical reasoning with generative AI. Data, 10, 172. https://doi.org/10.3390/data10110172

Hao, Y., et al. (2024). Developers’ shared conversations with ChatGPT in GitHub pull requests and issues. Empirical Software Engineering, 29(6). https://doi.org/10.1007/s10664-024-10540-x

Haroon, S., Khan, M. T., & Gulzar, M. A. (2026). Evaluating LLM-based test generation under software evolution. arXiv:2603.23443.

Li, M., et al. (2024). Prompting LLMs to tackle the full SDLC: DevEval. arXiv preprint.

Li, R., et al. (2026). Unveiling the role of ChatGPT in software development: Insights from developer-ChatGPT interactions on GitHub. ACM Transactions on Software Engineering and Methodology.

Liang, J., et al. (2024). Usability of AI programming assistants: Successes and challenges. Proceedings of ICSE 2024, 1-13. https://doi.org/10.1145/3597503.3608128

Lin, B., et al. (2024). Improving automated code reviews: Learning from experience. Proceedings of MSR 2024, 278-283. https://doi.org/10.1145/3643991.3644910

Liu, Y., et al. (2026). Debt behind the AI boom: A large-scale empirical study of AI-generated code in the wild. arXiv:2603.28592.

Mallya, M. A., Ferrari, A., Zadenoori, M. A., & Dąbrowski, J. (2025). From online user feedback to requirements: Evaluating large language models for classification and specification tasks. arXiv:2510.23055.

Mohamed, A., et al. (2025). The impact of LLM-assistants on software developer productivity: A systematic review and mapping study. arXiv:2507.03156.

Mozannar, H., et al. (2024). Reading between the lines: Modeling user behavior and costs in AI-assisted programming. Proceedings of CHI 2024.

Mu, F., et al. (2024). ClarifyGPT: A framework for enhancing LLM-based code generation via requirements clarification. Proceedings of the ACM on Software Engineering, 1(FSE), 2332-2354. https://doi.org/10.1145/3660810

Ogenrwot, D., & Businge, J. (2026). PatchTrack: A comprehensive analysis of ChatGPT’s influence on pull request outcomes. Empirical Software Engineering, 31(5). https://doi.org/10.1007/s10664-026-10869-5

Patcas, R., & Motogna, S. (2026). An evaluation study of large language models for addressing code quality issues. Empirical Software Engineering, 31, 118. https://doi.org/10.1007/s10664-026-10858-8

Peng, Y., Kim, K., Meng, L., & Liu, K. (2025). iCodeReviewer: Improving secure code review with mixture of prompts. arXiv:2510.12186.

Russo, D. (2024). Navigating the complexity of generative AI adoption in software engineering. ACM Transactions on Software Engineering and Methodology.

Sarkar, A., et al. (2024). AI should challenge, not obey. Communications of the ACM, 67(10), 18-21. https://doi.org/10.1145/3673413 (Preprint: arXiv:2412.15030)

Sergeyuk, A., et al. (2024). Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward. Information and Software Technology. https://doi.org/10.1016/j.infsof.2024.107610

Shao, D., & Ishengoma, F. (2026). Empirical analysis of generative AI tool adoption in software development. Information and Software Technology, 192, 108036.

Simões, M., & Venson, E. (2024). Evaluating source code quality with large language models. arXiv preprint.

Song, Y., et al. (2024). The impact of generative AI on collaborative OSS development. arXiv:2410.02091.

Sun, X., et al. (2026). Quality assurance of LLM-generated code: Addressing non-functional quality characteristics. Journal of Systems and Software, 238, 112885. https://doi.org/10.1016/j.jss.2026.112885

Tessa, M., et al. (2026). How secure is secure code generation? Adversarial prompts put LLM defenses to the test. arXiv:2601.07084.

Tony, C., et al. (2024). Prompting techniques for secure code generation. arXiv preprint.

Ullrich, J., et al. (2025). From requirements to code: Developer practices in LLM-assisted software engineering. Proceedings of IEEE RE 2025. https://doi.org/10.1109/RE63999.2025.00032

Vijayvergiya, M., et al. (2024). AI-assisted assessment of coding practices in modern code review. Proceedings of the 1st ACM International Conference on AI-Powered Software (AIware ’24), 85-93. https://doi.org/10.1145/3664646.3665664

Vigh, E., Sunesen, F., & Barkhuus, L. (2026). “AI does not understand the real world”: AI augmented software development. CHI EA ’26.

Wadhwa, N., et al. (2024). CORE: Resolving code quality issues using LLMs. Proceedings of the ACM on Software Engineering, 1(FSE), 789-811. https://doi.org/10.1145/3643762

Watanabe, M., et al. (2026). On the use of agentic coding: An empirical study of pull requests on GitHub. ACM Transactions on Software Engineering and Methodology.

Yu, L., et al. (2026). Evaluating the quality of GenAI applications in software engineering: A multi-case study. Empirical Software Engineering, 31, 29. https://doi.org/10.1007/s10664-025-10759-2


AI Statement

AI was used to support the structuring of this literature synthesis, comparison of sources, consistency checking between source use and argumentation, and sharpening of wording. The content choices, source weighting, interpretation, final editing, and responsibility for the final text remain with the author.