From Noise to Self-Learning: The Evolution of an AI Review Bot

Engineering | April 14, 2026

When development outpaces review, the approach we chose

After adopting AI coding tools, our development speed jumped and review became the bottleneck. To solve it, we built a PR review bot. I'd like to share the journey of improving it.

Why Did We Build a PR Review Bot?

When Development Speed Outpaces Review

After we adopted AI coding tools, our team's development speed picked up in a noticeable way. The problem was that review couldn't keep up.

PRs started landing faster, but the reviewers were still the same people. Naturally, the review queue piled up, and review became a deployment bottleneck over and over. Situations like "I wrote the code in 30 minutes, but the review took a day" became the norm.

In fact, before the review bot, our team's median PR merge time was 14.8 hours, and each PR received an average of just 0.9 comments.

The Burden Concentrated on Individual Reviewers

As the number of PRs to review grew, the burden each human reviewer had to shoulder grew too. If I was slow to review, my colleague's deploy got delayed, and testing got delayed. That pressure either dropped the quality of reviews or piled fatigue onto the reviewer.

The Clash With Adopting TBD

Around that same time, our FE team was adopting Trunk-Based Development (TBD). The core of TBD is merging small PRs quickly, and the review bottleneck collided head-on with that flow. You're supposed to merge short-lived branches quickly, but when you get stuck in review, TBD's advantages disappear.

"Can we automate the basic code-quality checks and lighten the load on human reviewers?" — that was the starting point for building the review bot.

The Review Bot Improvement Timeline

Over three months of running the review bot, we went through seven major improvements. (There were a lot of smaller fixes too.) Let me share what problem we hit in each Phase and why we made the choices we made.

Phase 1 — Initial Build

Claude-based setup, cost control via triage

Phase 2 — Noise Removal

A flood of noise

Phase 3 — Architecture Shift

From monolith to pipeline, and feeding context to the bot

Phase 4 — The Paradox of Rules

Turn 'no speculation' up too high and the reviews disappear

Phase 5 — 2-Pass Self-Critique

Same model, different perspective — simulating reviewer and author at once

Phase 6 — Completing the Context

Don't just look at the diff — read the design docs

Phase 7 — Autonomous Learning Loop

Automatically converting human reactions into training data

Before vs. After — How Did the Review Culture Change?

Shifting the Human Reviewer's Role

Before the review bot, human reviewers had to look at everything — type mistakes, import order, naming, business logic, even design intent. After adopting the bot, basic code quality and context-based checks are handled by the bot, and human reviewers can focus on business logic and design decisions.

The Emergence of a Bypass Culture

One of the interesting shifts is that the bypass (fast-track approval) pattern started being used heavily. For PRs where the bot had already finished its basic checks, a pattern naturally settled in of human reviewers quickly confirming and approving. Because the bot had already checked "is anything missed?" once, the human reviewer no longer needed to comb through the whole thing — they could focus only on what the bot flagged and on the business logic.

In numbers, the bypass rate climbed to 83% after stabilization. This wasn't just "skipping review" — the data backs it up:

Everyone on the team bypassed evenly — it became a team culture, not one person's habit
Large PRs (500+ lines) still got human review — 32% of Approved PRs were 500+ lines, showing healthy judgment
Bypass hit 92–100% during off-hours — the bot served as a first safety net regardless of time zone
53% of bypassed PRs had actual edits on files the bot flagged — direct evidence that code quality improved even without a human reviewer

Shorter Review Wait Times

It meshed well with the TBD flow too. Bot reviews run the moment a PR is opened, so they're time-zone-independent. Even when human reviewers are offline, you can get basic feedback right away, and by the time the reviewer is back, the bot has already finished the basic checks, so review time dropped significantly.

Overall Results: Three Months in Numbers

We measured three periods — before the bot (P1), early adoption (P2), and stabilization (P3) — using the GitHub API.

Key Metric Changes (P1 → P3)

Metric	No Bot (P1)	Early Adoption (P2)	Stabilized (P3)	Change
Median Merge Time	14.8h	13.0h	1.2h	-92%
Bot Comments / PR	0.1	2.2	1.0	Stabilized after noise removal
Comments / PR	0.9	3.9	2.1	-46% vs P2 (noise removal)
Bypass Merge Rate	32%	53%	83%	Shift in review ownership

The standout is median merge time dropping from 14.8 hours to 1.2 hours — a 92% reduction. The median shift is far more dramatic than the mean (24.9 → 15.6), which means "most PRs merge fast." A fast-merge culture aligned with TBD took hold.

Comparison With Industry Benchmarks

According to Faros AI's research (10,000+ developers, 1,255 teams), after AI adoption most teams see a 91% increase in review time. PRs multiply, yet the review bottleneck gets worse — a paradox. Faros AI explains this through Amdahl's Law — "a system moves only as fast as its slowest link."

Metric	Industry Average	Our Team
PR Merge Count Change	+98% (Faros AI)	+93%
PR Cycle Time	-24% (Jellyfish)	-37%
Review Time Change	+91% increase (Faros AI)	-16% decrease

In an industry where review bottlenecks are getting worse, our team used the review bot to automate that slowest link and break through the paradox.

FE Incident Trend

"We merged fast — but was quality okay?" The FE incident data answers that directly.

Period	Incident / PR
P1 (No Bot)	3.4%
P2 (Early Adoption)	1.8%
P3 (Stabilized)	0.9%

PRs doubled and bypass climbed to 83%, yet the incident-to-PR ratio improved from 3.4% to 0.9% — roughly a 3.8x improvement. Direct evidence that faster reviews did not lead to lower quality.

Wrapping Up

The biggest thing I learned across these seven major iterations is that an AI review bot isn't "an AI that reads code well" — it's "a feedback system that fits the team's culture."

If all we'd gained was faster merges, it wouldn't have meant much. No matter how fast you merge, it's pointless if quality drops. But looking at three months of data, PRs doubled and bypass reached 83%, yet the incident-to-PR rate actually fell from 3.4% to 0.9%. Fast and safe — the data proved both are possible.

It wasn't about just editing prompts or yml files. What mattered was defining "what kind of review is valuable to the team," and that definition came out of the team's reaction data. In the end, I think a good review bot is a reflection of a good review culture.

Along the way, we had to think through both perspectives — the PR author's and the reviewer's.

From the PR author's side, nitpicks about code style or repetitive UI comments weren't the kind of review we expected from reviewers. What authors wanted was the team's opinion on things like "is this design decision right?" or "could this change affect other features?"

From the reviewer's side, it's the same. Staring at a huge pile of UI code or a PR context where you don't know what to look at first, the moment "what on earth am I supposed to look at?" crosses your mind, you end up missing the review that actually matters.

A review where both sides can focus on the parts that genuinely need discussion — that is the review culture we wanted to build with the review bot.