The latest in Playbook

Reading time: 10 min

The Compliance Cost of 10x AI Engineering

Daniella Mitchell
Written by Daniella MitchellApr 27, 2026

The Compliance Cost of 10x AI Engineering

A VP of Engineering at a European insurer told me last quarter that her team had doubled commit velocity since rolling out Cursor. Then she told me their release cadence had not changed. Same monthly release window, same number of features in production. The doubled output was sitting in a queue waiting for security and compliance review.

She asked me whether the AI investment was a waste. I told her it was not, but the bottleneck had moved, and nobody on her team had moved with it.

This article is about that bottleneck. The marketing pitch for AI coding agents sounds the same regardless of industry: 10x productivity, faster shipping, smaller teams. In regulated environments, that pitch crashes into the structural reality of how regulated software actually ships, which is to say slowly and through several layers of human judgment that AI does not replace.

I want to walk through what actually changes when you put a coding agent into a bank, an insurer, or a pharma R&D org, and what the operating model has to do to absorb it.

The 10x number and where it comes from

The 10x figure comes from a few specific sources. GitHub published research in 2022 showing 55% faster task completion for Copilot users on a controlled coding benchmark. That number gets cited everywhere. What gets cited a lot less is the METR randomized controlled trial from July 2025, which measured experienced open-source developers using early-2025 AI tools on their own repositories and found that the tools made them 19% slower. The developers themselves believed they were 20% faster, while the clock disagreed.

I am not citing METR to argue AI tools are useless. I am citing it to ground the conversation. The productivity range is real but wider and more contingent than the marketing suggests. In environments where the codebase is unfamiliar to the AI, the task is non-routine, and the cost of being wrong is high, the gains shrink fast or invert.

None of those studies measured regulated-industry workflows. They measured experienced developers in environments without change advisory boards, validation suites, or external auditors. The throughput multiplier is real for some slices of work. It is just a small slice of what happens in regulated software.

When the same coding agent is dropped into, say, a bank's payments platform, three things happen at once. Code production speeds up roughly as advertised. Code review gets faster on the mechanical parts and slower on the judgment parts (more code per PR means more cognitive load per review). And every downstream control (SAST, DAST, secret scanning, license check, change ticket, security review, regression suite, release sign-off) sees more volume per unit time.

The downstream stack is what determines actual throughput. If that stack does not scale, the AI investment piles up in queues.

Where the new bottlenecks actually appear

Across the engagements I have seen in the past year, four bottlenecks recur.

Security review queue. Most regulated orgs have a centralized AppSec team that reviews changes touching authentication, authorization, data handling, or external interfaces. That team is small (5-10 people for a 500-engineer org is typical). When commit volume doubles, AppSec review wait times double. Engineers context-switch, PRs go stale, the velocity gain evaporates.

Change Advisory Board (CAB) latency. Banks and insurers run CAB meetings weekly or biweekly. The meeting agenda has a fixed slot count. More changes means more deferrals, not faster meetings.

Regression coverage on safety paths. In pharma, any change to GxP-scoped code triggers a validation re-run under 21 CFR Part 11. In banking, changes to settlement, KYC, or AML logic trigger a regression suite that takes hours. AI doubles the rate of changes that hit these paths but does not speed up the validation. That is straightforwardly more total wait time per feature. (We covered the EU regulatory layer over agentic systems in our EU AI Act compliance checklist.)

Documentation debt. SOC 2, DORA, FDA submissions, ISO 27001 audits, all of them require written evidence that changes were tested, reviewed, and approved. AI-generated code does not write its own change ticket. The doc burden grows linearly with throughput, and somebody has to absorb it. Usually that somebody is the same senior engineer who is now reviewing twice the code.

The pattern is universal: AI accelerates the parts of the SDLC that were never the constraint, while the actual constraint sits untouched. This is Goldratt's Theory of Constraints applied to AI tooling, and it is shocking how rarely the conversation gets framed that way.

The three moves that fix it

I have seen three operational changes work, used in combination.

The first is automating the mechanical parts of review. License scanning (FOSSA, ScanCode), secret detection (Gitleaks, TruffleHog), schema-change detection (custom tooling, usually), AI-suggested test coverage. None of these replace human judgment. They take the parts that were human judgment and were really just pattern-matching, and move them to CI. AppSec reviewers stop spending their day flagging hardcoded API keys and start spending it on actual threat-model questions.

The second is expanding the reviewer pool. Most regulated orgs route compliance-sensitive review through a small specialist team because nobody else has been trained. Train senior engineers in your compliance framework (DORA, HIPAA, PCI, whatever applies). It is a 2-week investment per person and it doubles or triples your effective review capacity. The specialist team becomes the appeals court, not the first line. This is the single highest-impact move I have seen.

The third is risk-segmenting the codebase. Not every line of code needs the same review intensity. A change to the marketing site CTA does not need the same scrutiny as a change to the AML decisioning service. CODEOWNERS, branch protection rules, and a tagged service inventory can route low-risk AI-assisted changes through lightweight review while sensitive paths trigger the full machinery. This requires a one-time taxonomy effort and pays off forever.

I have seen these three moves, deployed together, restore most of the AI productivity gain in regulated orgs. Without them, the gain stays in the queue.

A word on metrics, because the wrong ones lie to you

Engineering leaders who measure AI ROI by commit count or PR count in a regulated org are measuring the wrong thing. Those metrics will go up. Production releases will not. The disconnect creates a false sense of progress that lasts about a quarter, until the board asks why feature delivery has not accelerated.

Better metrics, in roughly this order:

Time to production-ready, defined as PR open to merge plus all required reviews, scans, and documentation completed.

Audit findings per release. If AI-assisted code is creating more rework, this number climbs. Watch it like a hawk.

Rework rate. Specifically, the percentage of merged PRs that get reverted, modified, or rolled back within 30 days. AI-generated code often passes review and fails in production for subtle reasons. This metric catches it.

Reviewer cycle time and reviewer load. If your AppSec team is at 90% utilization, you have a problem regardless of how fast the rest of the team moves.

I am genuinely allergic to the just measure DORA metrics answer here, because DORA-the-metrics-framework (the Accelerate book, not the EU regulation, confusing I know) was written for SaaS environments where the constraint is deployment frequency. In regulated environments, deployment frequency is constrained by external factors AI does not touch.

What this means for the rollout

If you are a CTO or VP Eng in a regulated industry rolling out an AI coding tool this year, here is the order of operations I would recommend, based on what has worked.

Before you turn on the tool widely, audit your downstream review and validation capacity. Map the queue lengths today. Project them at 2x commit volume. Identify which queues will saturate first. (Hint: it is almost always AppSec.)

Then make the three operational changes above, in parallel with the tool rollout. Not after. The temptation to ship the tool first and figure out the bottleneck later is strong, and it is wrong, because once your engineers have the velocity gain in their hands they do not want to slow down for a process redesign.

Finally, instrument the right metrics from week one. The wrong metrics will make you look good for a quarter and bad for a year. The right ones will let you make the case to the board that AI is paying off, with real release-cadence and audit-quality numbers behind it.

The AI productivity story in regulated industries is not the same story as in unregulated SaaS. It is a real story, and the numbers can be good. They just require the operating model to move with the tool, not behind it.

If you need help mapping the bottleneck migration in your specific environment, this is exactly what we do. The control patterns differ between banking, insurance, and pharma, but the diagnostic framework is the same. Our companion guide on AI coding agents and SOC 2 walks through the rollout controls, and the audit trail architecture piece covers the logging side. Get in touch and we can run a 2-week assessment.


Daniella Mitchell leads compliance engineering at The BrightByte. She has helped engineering organizations in fintech, insurance, and pharma redesign their SDLC operating models around AI coding tools.

Did you find this post interesting? Share it!

Featured