# 58% of PRs in our largest monorepo merge without human review

**Published:** April 6, 2026 | **Authors:** John Phamous, Kacee Taylor, Eric Dodds | **Category:** Engineering

---

One of our oldest and largest Next.js apps is a monorepo that contains multiple critical properties: the Vercel marketing site, our docs, the sign-up flow, dashboards, and internal tooling. The repo sees over 400 pull requests per week on average. Until recently, every one of them required human approval before merging.

Today, an agent reviews and merges 58% of those pull requests without a human reviewer, and average merge time has dropped 62%, from 29 hours to 10.9 hours.

Merging agent-generated code [can be dangerous](https://vercel.com/blog/agent-responsibly). This is a real example of how you can use agents themselves to deploy to production, safely.

## The problem: review bottlenecks

Critical design updates and A/B tests need to go live as quickly as possible, but they weren't making it to production as fast as we wanted. After analyzing PRs on the repo, we learned that average time from ready-for-review to merge was 29 hours. That's more than a full working day spent waiting.

Digging deeper, we discovered that over half of PRs were approved (by humans) with zero comments. 18% were rubber-stamped in under 5 minutes.

So we asked ourselves: If most reviews aren't catching anything, what are they actually protecting against?

Pull requests can esaily conflate two distinct activities:

- **Alignment** is agreeing on what to build and how: the architecture, structure, and design decisions
- **Verification** is confirming that what was built works correctly

Most changes in a mature codebase like this one only need verification, and AI can handle verification well. Requiring a human to approve every CSS tweak and docs update doesn't make the codebase safer, it makes engineers slower and delays basic updates unnecessarily.

Ironically, AI is making the problem worse: more PRs flow into the bottleneck as agents generate code. But the answer isn't asking engineers to review harder and faster. It's building systems that can distinguish between changes that need human judgment and those that don't.

Here's how we built our auto-merge workflow and what we learned along the way.

## Start with a risk framework

The key insight was plain in the initial analysis: not all PRs carry equal risk. A documentation fix and an authentication change have fundamentally different blast radiuses. We needed a way to classify that risk automatically.

From the beginning, we collaborated with Kacee Taylor, our Head of Governance, Risk, and Compliance. She provided critical guidance in building the framework, tracking its performance, and maintaining compliance.

We built an LLM-based PR classifier using Gemini that evaluates every PR based on its diff, title, and description. The classifier assigns one of two labels.

- **HIGH risk** includes changes to authentication, payments, data integrity, security, and infrastructure. These always require human review
- **LOW risk** includes UI changes, styling, tests, documentation, refactors, and feature flags that are turned off. These are candidates for auto-approval.

The classifier returns structured JSON:

```json
{
  "evidenceQuotes": ["+ color: var(--ds-gray-600)"], 
  "rationale": "CSS-only theme changes", 
  "changes": ["`dashboard-theme.css`: updated color custom properties"], 
  "riskLevel": "LOW" 
}
```

The schema puts `evidenceQuotes` first and `riskLevel` last. This forces the model to extract verbatim diff snippets and reason about them before it can classify. If it can’t find evidence of risk in the actual code, it defaults to LOW. The decision is grounded in the diff, not the PR title.

The classifier is also tuned to prefer false HIGHs over false LOWs. A false HIGH costs one unnecessary review. A false LOW lets risky code ship unreviewed. It flags 93% of data integrity PRs and 92% of security PRs as HIGH risk. On the other end of the spectrum, 0.2% of styling PRs and 0.4% of docs PRs get flagged HIGH.

These categories aren't fixed. Every risk assessment includes an "Incorrect?" link that logs the response to Datadog and routes a notification to Slack. When an engineer flags a misclassification, we review it, and if the classifier was wrong, we add the PR to our evals.

Two hard rules bypass the LLM: PRs with 100+ changed files are always HIGH, and `CODEOWNERS-protected` paths always require human review.

All LLM calls route through [Vercel AI Gateway](https://vercel.com/ai-gateway) for caching, rate limiting, and observability. The cost is ~$0.054 per assessment, or about $51/week.

This approach puts into practice what we recently described as executable guardrails.  Instead of a wiki page listing what counts as risky, we [encoded that judgment into the pipeline](https://vercel.com/blog/agent-responsibly) itself.

## Testing, validation, and rollout

We rolled this out in three phases, each designed to build confidence before increasing the level of merge autonomy. Before starting the test, we defined kill switches. The experiment would end if:

- The revert rate exceeded 3x our baseline of 1.7% (a 5.1% threshold)
- The rollback rate exceeded 3x baseline (7.2/week threshold)
- Team sentiment turned negative

Here are the phases of the experiment and what happened in each:

### Phase 1: silent classification

The LLM began labeling every PR as LOW or HIGH risk. The only visible signal was an informational GitHub check that surfaced the classification. Nothing changed operationally for the agent or the team.

We collected data and validated accuracy against our own assessment of risk. It took about three weeks of prompt iteration to meet our accuracy thresholds. At that point, we were ready to validate results with our engineering team.

### Phase 2: visible labels

Vercel Agent started commenting on every PR with the risk classification and rationale. Engineers could see the reasoning, challenge it, and click “Incorrect?” to flag mistakes.

### Phase 3: enforcement

In this phase, LOW-risk PRs were auto-approved by Vercel Agent, satisfying branch protection without a human reviewer. HIGH-risk PRs got a warning comment and still required human approval.

Engineers were still able to request review on any PR they submitted. The change was that review was no longer a blocker for low-risk changes.

The results cleared every safety threshold, and the workflow is now default for the repo.

Vercel's compliance posture was maintained throughout the experimentation and enforcement process, including SOC-2 compliance. We cover details in the compliance section below.

## Results

### Skipping review didn't increase reverts

This was the question that mattered most. If we let low-risk PRs skip review, would more bad code reach production?

671 low-risk PRs skipped review. Zero were reverted. (Wilson 95% CI upper bound: 0.6%, well below our 1% safety threshold.)

The control group (low-risk PRs that still received review) had the same revert rate: 2 out of 513 (0.39%). Skipping review made no measurable difference.

Deployment rollbacks decreased from 2.8 per week to 1.9 per week. None of the rollbacks during the experiment were caused by an auto-approved PR. We mapped each rolled-back deployment to the triggering PR via [the Vercel Activity Log](https://vercel.com/docs/cli/activity).

The one incident-causing rollback was a middleware redirect change. The classifier flagged it HIGH. A human reviewed it, approved it, and merged it. The classifier caught the dangerous change, but the human let it through.

### Engineers shipped 62% faster

PRs that skipped review had a median merge time of 0.5 hours, compared to 2.3 hours for reviewed PRs. The gap widens at the tail: at p90, skipped PRs were 58.3 hours faster than reviewed PRs.

Adoption was immediate. The week enforcement turned on, 61% of low-risk PRs skipped review.

Individual human throughput increased 46%. PRs per active author went from 2.6 per week to 3.8 per week.

Peak merge time shifted from 2-4pm to 6-10pm PST. Off-hours merges increased by 7.5 percentage points, weekend merges by 6.3 percentage points. Engineers now merge when the work is done, not when a reviewer is online.

### Human review got better where it matters

Time-to-first-review on HIGH-risk, large-diff PRs dropped from 24.7 hours to 9.0 hours, a 2.7x improvement. When a risky change needs human eyes, it gets them faster.

Reviewer workload also decreased from 13 PRs per week to just over 5. With fewer PRs to process, engineers perform more thorough reviews.

Rubber-stamp rates on HIGH-risk PRs held steady (11.9% vs 12.4% baseline). Security concerns flagged in reviews jumped from 6.3% to 27.2% (n=261 HIGH-risk PRs in stage 2). The review depth on small HIGH-risk diffs improved (Cohen's d = 0.44).

### Do engineers agree with the classifier?

We measured engineer disagreement through behavior, not surveys.

| **Signal** | **Rate** |
| --- | --- |
| CHANGES_REQUESTED on LOW-risk PRs | 0.9% |
| LOW-risk PRs reverted | 0.2% |

43% of low-risk PRs still received voluntary reviews even though they weren’t required. 70% of those had zero comments. Some teams still prefer review for collaboration and knowledge sharing, but the point is making review a choice, not a gate.

## Adversarial hardening

The classifier processes user-controlled input and auto-approves based on the result. This is an adversarial surface.

### Architecture

The system is designed so that even a fully compromised LLM output can’t cause serious harm:

- **Zero tools.** The LLM outputs structured JSON. No code execution, no file access, no API calls.
- **Constrained output, pre-determined actions.** The model can only return two valid risk levels, which map to two possible system actions: APPROVE or WARNING COMMENT. There is no path from LLM output to arbitrary behavior.
- **Fail-open.** If the LLM fails or returns invalid output, the PR falls back to standard human review.

### Input hardening

**Invisible Unicode stripping.** We strip tag characters (U+E0000-E007F), variation selectors, and bidi overrides from all LLM inputs. These invisible characters can smuggle instructions into prompts. GitHub preserves them in diffs. This was exploited in the [GlassWorm campaign](https://www.pillar.security/blog/new-vulnerability-in-github-copilot-and-cursor-how-hackers-can-weaponize-code-agents/) (151+ repositories, invisible prompt injection).

**Output sanitization.** Model-generated text is sanitized before posting to GitHub. Non-HTTPS links and image embeds are stripped.

**Author gating.** PRs from untrusted authors (first-time contributors, placeholder accounts) get classification and a posted assessment, but never auto-approval.

**Adversarial eval suite.** Three prompt injection scenarios (Unicode smuggling, XML tag injection, code comment manipulation) run on every deploy with a 100% accuracy gate.

### What we considered and rejected

We explored additional approaches to hardening, but didn't implement them.

| Defense | Why |
| --- | --- |
| Salted XML tags | Chatbot threat model, not structured classification. Breaks prompt caching. |
| Sandwich defense | 5.5-85.6% attack success rate in published research; >95% against adaptive attacks |
| XML-escaping inputs | Mangles legitimate code |

### Limitations

Classification is probabilistic. [Carlini et al. (2025)](https://arxiv.org/abs/2504.00038) showed 12 published prompt injection defenses bypassed at 71-100% success rates. No single defense is absolute.

The classifier speeds up easy decisions. It doesn’t replace judgment on hard ones. HIGH-risk changes always require human review. The worst case for a successful attack is one low-risk PR getting auto-approved, and we monitor revert and rollback rates continuously to catch drift.

## Compliance

A common question from engineering leaders is: what about compliance?

Compliance frameworks require a well-defined change management program, not specifically manual peer review. What matters is that changes are authorized, documented, tested, and approved through a consistent, auditable process.

Adding an LLM-based risk classifier strengthens our change management process in three ways:

- **Better documentation. **Every PR now gets a structured risk assessment with reasoning, evidence, and a classification. Under mandatory review, 52% of approvals had no documentation at all. The audit trail went from a single approval click to a full risk rationale per PR.
- **Risk-based routing.** Instead of treating every change the same, the classifier routes human attention to HIGH-risk changes where it matters most. LOW-risk changes flow through a consistent, auditable approval process. Security-sensitive paths still require designated reviewers via `CODEOWNERS`.
- **Continuous monitoring.** Revert rates, rollback rates, and classifier accuracy are tracked weekly. This creates a feedback loop that mandatory review never had: we can measure whether the process is working, not just assume it is.

Including an LLM-based PR classifier modeled after our internal risk-based approach has enhanced our change management processes and provided additional documentation for auditability.

## What we learned

**Mandatory review was already theater.** 52% of reviews produced nothing. Auto-approve didn’t remove a functioning safety net. It stopped requiring one that wasn’t working. 671 skipped reviews, same revert rate as reviewed PRs, 62% faster merges.

**The real gain is focus, not just speed.** Reviewers reach HIGH-risk PRs 2.7x faster. The bottleneck for critical PRs was never review capacity. It was review allocation.

**Conservative classification is the right default.** The cost of a false HIGH is one unnecessary review. The cost of a false LOW is risky code shipping unreviewed. Over-flag and let engineers opt out.

**Review became a choice, not a gate.** 43% of low-risk PRs still received voluntary reviews. Some teams prefer review for collaboration and knowledge sharing, which is the ideal state.

## What's next

The skip-review workflow is now permanent on our largest monorepo. We’re rolling it out to more repositories using the same three-phase approach.

As agents generate more code and PR volume increases, the review bottleneck will only get worse for teams that don’t adapt. The answer isn’t reviewing harder. It’s building systems that encode risk judgment into the pipeline.

The scarce resource is human judgment. Spend it where it counts.

---

**More posts:** [View all blog posts](https://vercel.com/blog/sitemap.md) | [Changelog](https://vercel.com/changelog/sitemap.md)