Skip to main content

// FOR STAFF ENGINEERS & TECH LEADS

14 PRs. 9 AI-assisted. It's 2:47 PM.

You have to read every line because "looks plausible" is how bugs get merged three approvals deep. We built Guardian because we were tired of writing that Slack message with no data behind it.

// // 14:47 — YOUR REVIEW QUEUE

This is every Tuesday afternoon.

review-queue.log

13:02 opened PR #4821 "feat: add retry to webhook handler" +312 −47 author: alice (with claude-code)

13:04 agent comment coderabbit: 18 comments, 14 "consider" nits

13:18 your turn read all 312 lines because "looks plausible" is how bugs ship

13:41 found it off-by-one in the backoff — agent missed it, two juniors already approved

14:12 opened PR #4822 "refactor: extract billing client" +1,104 −890 author: diego (with copilot)

14:47 you staring at 9 of these still in the queue. your CTO wants "more AI output."

// THE NUMBERS

Not our numbers. Everyone else's.

Before we show you what Guardian sees on your repo, here's what the people who actually run studies have already found. Sample sizes included because you'll ask.

CodeRabbit, AI vs Human Code Report

1.7x / 8x

more issues in AI code than human code. 8x more performance inefficiencies.

Sonar, 2026 State of Code Developer Survey

96% / 48%

of devs don't fully trust AI output. Only 48% verify before committing.

METR, Measuring the Impact of Early-2025 AI on Experienced OSS Developers

−19% / +24%

Devs were 19% slower with AI. They thought they were 24% faster.

Faros.ai, AI Productivity Paradox Report

+98% / +91% / +9%

PR volume. Review time. Bugs per developer. All up. Company-level throughput: unchanged.

// These are industry numbers. Guardian shows you yours, on your repo, next week.

// METHODOLOGY

What Guardian measures. And what it doesn't.

Deterministic Git/PR analytics plus LLM comment classification. The LLM tags the comment type (nit, logic, perf, style) — it does not judge whether the comment is "good." Humans are still the baseline. The methodology is public. You can PR it.

What it measures

  • reviewer effectiveness: did the review comment lead to a code change in this PR?
  • comment classification: nit vs non-nit, via Groq/Gemini, pattern-matched not rated
  • code durability: does the change survive, or does it get reverted in 2 weeks?
  • humans and AI reviewers on the same axis — same PR, same repo, same week

What it does NOT measure

  • "quality" as a vibe — we don't score your code, we score the review loop
  • lines of code, deploys per day, or any vanity metric
  • who is "best" — profiles are not leaderboards, growth not punishment
  • whether the AI agent's comment was correct — only whether it landed

// WHAT YOU SEE

Three screens. No mock data.

These are real screenshots from a production tenant. Numbers blurred where they identify anyone.

Guardian agent metrics overview showing multiple AI review agents compared on comment volume and effectiveness

feat(guardian): quantify every agent's signal-to-noise ratio across repos

Every AI review agent on your repos, side by side. Comment volume, non-nit rate, change-adoption rate. You will find out which agent is carrying its weight within the first hour.

Guardian agent detail screen showing per-PR comments and code-change adoption rate

fix(review): drill into the agent that's costing your team minutes per PR

Pick the agent. See every PR it commented on this week, its non-nit rate, whether anyone changed the code in response. This is the view you'll screenshot for your CTO.

Guardian reviewers view showing human reviewers ranked alongside AI agents by effectiveness

chore(baseline): you are the reference. everyone else is measured against you

Your human reviewers on the same axis as every agent on your repos. This is the "humans-first" part of the pitch, made literal: the bar is set by the people who already do the job.

// 14:47 — #eng-quality

The message you'll send your team.

Not the one to your VP. The one to the people doing the work.

# eng-quality
Jake 2:47 PM

hey team, i pulled some numbers from the last 30 days across our repos using that guardian thing i mentioned.

the short version: our AI review agents leave ~18 comments per PR. about 2 of those actually lead to code changes. the rest we resolve-in-bulk and move on.

my own effectiveness is ~89% (comments that land as code changes). copilot-chat is at 41%. cursor-agent is at 28%. this isn't me ragging on the agents — it's just what the last 30 days looked like.

not a witch hunt. i just don't want us approving things we didn't read. i'm going to try two changes this sprint:

  1. we stop auto-resolving agent threads. if a agent comment doesn't turn into a code change, it gets a 👎 reaction so guardian can count it.
  2. juniors pair-review with me on AI-generated PRs for two weeks. we're getting rubber-stamp patterns and i own that.

if you want the raw data, it's at guardian.local/reviewers. i'm on slack if you disagree with any of it.

// THE REST OF THE SUITE

Guardian is the measurement core.

Releezy Loop runs AI coding agents in governed containers. Releezy Reviewer is an autonomous reviewer, project-specific. Both are measured by Guardian against your human baseline — no exceptions, no tuning Guardian to flatter them. If they can't meet your team's bar, you'll see it before we do.

// $ releezy init

See your own numbers.

One repo, one week of data, one Slack message you'll actually believe. If the numbers don't match what you already suspect, we'll close the account ourselves.