← back to blog
    Berner SetterwallMay 28, 202612 min read

    Self-Driving Pull Requests: How Cogny Sites, Cogny Code, and Visual Proof Removed Us From the Merge Loop

    Self-Driving Pull Requests: How Cogny Sites, Cogny Code, and Visual Proof Removed Us From the Merge Loop

    A few weeks ago I was looking at our git log and noticed something embarrassing. PRs that had passed review were sitting open for weeks before someone (usually me) remembered to type gh pr merge. The gap wasn't review — review happened fast. The gap was the manual step between "this is fine" and "this is shipped."

    Worse, when I did finally merge, follow-up commits I'd pushed seconds later would miss the train. The classic flow looked like:

    "Make a PR for that fix."

    "Done — also, I had one more tweak."

    ...merges the first one before the tweak commit lands...

    "Oh. OK. Let me make a follow-up PR."

    The cost wasn't just my time. It was trust — every time I sat on a PR for a week, the person who opened it learned not to expect their work to ship. Velocity is a culture problem before it's a tooling one, but tooling is what you can fix in an afternoon.

    This is the story of the afternoon. The end state is a self-driving PR-to-prod loop where my only job is to click Approve in the GitHub UI. Everything else — the preview environment, the AI review, the curl-based smoke check, the agent-driven visual proof, the worker-tag releases, the merge itself, the post-deploy verification — is automated.

    The pipeline doesn't lean on a new vendor or a paid CI plan. It's built on four Cogny features we already had, plus a few hundred lines of GitHub Actions glue. And the day we turned it on, it caught six real bugs in 30 minutes, including three in its own implementation and three lurking in production.

    The four Cogny pieces that make this work

    1. Cogny Sites: every PR gets a real preview URL

    This is the foundation. Every push to a PR branch triggers a Cogny Sites build that produces a Cloud Run preview at a stable URL like https://site-90f9bdef-pr-804-gq43eegqva-ew.a.run.app. The preview lifecycle is wired into GitHub's deployment_status events, so the moment the build finishes, downstream workflows get a notification with the live URL in the payload.

    Without this, the whole loop falls apart. You can't run a smoke test against a preview that doesn't exist. You can't ask a computer-use agent to take a screenshot of a Pull Request. The preview is what turns a PR from "some code on a branch" into "something you can actually look at and prod."

    If you've ever waited for a designer to "spin up a staging env" to check a change, you understand the gap Cogny Sites closes. The preview is instant, isolated, and URL-addressable — the three properties that let anything else automate against it.

    2. Cogny Code: the build-autofix loop

    The thing that breaks preview environments most often is a broken build. Someone forgets to commit a file. A dependency is missing. A type error slips through.

    Pre-Cogny-Code, the loop was: build fails → human reads the log → human pushes a fix → wait for next build. Often the human was me and often the failure was something I could have caught with a five-second glance at the log.

    src/lib/build-autofix/trigger.ts is the dispatcher. When a Cogny Sites build fails for a PR, it dispatches our coding agent in refine mode with the build log, the PR branch, and the failure context. The agent reads the log, investigates the underlying issue, fixes it, and pushes a new commit to the PR branch — which fires GitHub's synchronize webhook and re-triggers the build. The cycle caps at three attempts (configurable per workspace) so a genuinely stuck PR doesn't infinite-loop.

    The killer property here is that the human isn't in the loop. When the autofix works, you discover it by noticing your PR has an extra commit you didn't push, and the build is green. When it doesn't, you get one merged "we tried 3 times and gave up" status and the PR goes back to a normal review.

    We've found this turns ~80% of broken-build PRs into nothings — they just self-heal between when I push and when I next look. The other 20% I actually look at, but with the agent's three attempts on record, the bug is almost always understood by the time I get there.

    3. Cogny Code: deployment risk scoring on PR open

    Build-autofix runs after a build fails. The third piece runs before anything builds: the moment a PR is opened (or pushed to), Cogny fetches the diff and asks Claude Sonnet 4.6 a single structured question — if this PR shipped to production right now, what's the likelihood it breaks something?

    The model returns a numeric risk score, a one-paragraph summary, and a bulleted list of the specific factors that drove the score: large blast-radius migrations, changes to middleware or auth, edits in files the diff is suspiciously quiet about, etc. The score lands as a sticky comment on the PR (marker-tagged so updates replace the previous comment instead of stacking).

    Implementation lives in src/lib/deployment-risk/assess.ts, gated per-workspace by warehouse_context.deployment_risk_assessment_enabled. It's an in-app port of an earlier .github/workflows/deployment-risk.yml, which is why the comment marker and JSON contract are stable — moving the logic into the app didn't break the existing comment thread on older PRs.

    The score isn't a gate. It doesn't block merge, doesn't fail a check, doesn't require an override. It's a prior — a number you read before you decide how carefully to review. A score of 2 on a one-line typo fix tells you it's safe to skim. A score of 8 on a "small" auth refactor tells you to read every line and probably ask someone else to look. The number isn't always right, but it's right often enough that "how much time should I spend on this review?" has a useful default answer before I open the diff.

    Pairs nicely with build-autofix: high-risk PRs that also fail the build don't get auto-fixed (we set a lower attempt cap), because the risk score is also a signal about whether the agent should be intervening at all.

    4. Visual Proof: a computer-use agent that screenshots and verifies

    The last piece is the hardest to skip and the easiest to underestimate.

    A passing build doesn't mean the page renders. A passing test suite doesn't mean the dropdown menu is clickable. The class of bug that gets through every automated check is the visual or interaction regression: white-on-white text, a button that's been positioned 4 pixels off-screen, a hover state that no longer fires.

    cogny-visual-proof-reviewer/ is a Cloud Run service running a computer-use agent. You hand it a deployment URL and a list of changed pages (we resolve these from the PR's diff), and it launches Firefox at 1920x1080, navigates each page, takes screenshots, and verifies against a set of success criteria. Results post back to a callback URL.

    It's slow (30 seconds to two minutes per page) and not cheap (a few cents to a few tens of cents per PR), but it's the only layer that catches the bugs nothing else does. We treat it as an override-able gate: if it flags something the agent misread, the PR author or a maintainer comments /override visual-proof reason: <why> on the PR and the status flips to green. Reason is mandatory; the override lands in the merge commit body as an audit trail.

    This is the unlock for what gets called "agent-based testing." You can't fully trust a computer-use agent to decide whether your code ships — that would be reckless. But you can trust it as a suggesting authority whose judgment you override with one comment and one sentence of context. That makes it net-positive: the false positives cost a comment, the true positives save a rollback.

    The orchestration: AI sanity check, curl smoke, merge-on-green

    Cogny Sites, Cogny Code, and Visual Proof are the three big building blocks, but the loop also needs a few hundred lines of GitHub Actions glue to tie them together:

    • AI sanity check — every PR's diff goes through Claude Sonnet 4.6, which checks for a small list of well-known failure modes: secrets in the diff, new public.* tables without RLS, k8s env vars with literal value: next to siblings using valueFrom: (we got burned by exactly this in worker-v2026.05.27-4), destructive migrations without comments, and route/redirect changes that touch case-sensitive URLs. About 5 cents and 60 seconds per PR. Sticky comment + commit status.

    • Curl smoke — every URL whose status code and case matter (/SKILL.md and friends, manifests, infra files, top funnel pages) gets a single-line curl against the PR preview before merge, and against cogny.com after deploy. The script is 50 lines of Bash. We caught the /SKILL.md 308-loop incident from a few weeks ago retroactively — one line of curl against the preview would have blocked the merge.

    • merge-on-green — GitHub Free's private tier doesn't include branch protection or the merge queue, so we built our own. A workflow fires on either pull_request_review[approved] or workflow_run[completed] for our two required check workflows, validates that the PR has an APPROVED review with no CHANGES_REQUESTED and both required checks are green on the head SHA, and calls gh pr merge --squash --delete-branch.

    • Auto-tagger — touching src/workers/**, services/creative-probe/**, code-agent/**, or remotion-ad-renderer/** previously required a manual git tag worker-vYYYY.MM.DD-N && git push after merge to roll the corresponding Cloud Build pipeline. That step was routinely forgotten, leaving workers running stale code. Now a post-merge workflow inspects the diff and pushes the right tags automatically.

    • Deploy serialization — two simultaneous frontend deploys used to race the same Kubernetes Deployment and one would lose with a kubectl timeout. The screenshot of our Cloud Build trigger panel had several red-! failures per day from this. We inserted a single kubectl rollout status wait-step before the gke-deploy apply in cloudbuild.yaml — Docker builds still run in parallel, only the rollout serializes. The 2nd-in-line build pays 30-60 seconds extra at the deploy step.

    What it caught the day we turned it on

    We built this as a single PR (#804). Within 30 minutes of pushing the first commit, the loop had caught six real bugs:

    1. A YAML block-scalar parsing failure in our own auto-release-tags.yml (a literal blank line inside a bash multi-line string broke the on-trigger filter and the workflow fired on a feature branch).
    2. A missing actions/checkout@v4 in enable-auto-merge.yml (gh pr merge calls git under the hood; without a checkout it exits with fatal: not a git repository).
    3. An object-typed env value in visual-proof.yml (${{ github.event.deployment.payload }} is a JSON object, not a scalar — GitHub's parser rejected the workflow at runtime with "A mapping was not expected").
    4. allow_auto_merge: false on the repo, plus branch protection and rulesets both 403'd by the free tier — meaning gh pr merge --auto wouldn't have worked at all. We swapped in our own merge-on-green.yml gate to compensate.
    5. /Skills.md, /SKILLS.md, /AUTH.md, /Auth.md, /LLMS.txt, /Llms.txt all 404 on production right now. The middleware we shipped in PR #775 claimed to handle every casing variant, but the underlying Next.js matcher doesn't actually behave the way the comment claimed. We discovered this the first time we pointed the smoke check at cogny.com. None of these URLs are linked from anywhere in our app, so the bug had been silently hurting external agent fetches for over a week.
    6. A bootstrap chicken-and-eggmerge-on-green.yml couldn't merge the PR that introduced it, because pull_request_review triggers read the workflow file from the default branch. We merged PR #804 manually, one last time, and every PR since has flowed through the gate.

    Three of these were bugs in our own implementation that we'd never have noticed without running the loop end-to-end. Three were live production issues the loop surfaced on the first invocation.

    If you're building a similar pipeline, the lesson is: the act of dogfooding it surfaces more bugs than any amount of unit testing. We didn't run a single yamllint locally — we ran git push and let the live CI tell us what was broken, then fixed forward.

    What's still manual (and what's next)

    • Approval. This is intentional. The whole point of the loop is that one human still says "yes, ship this." We just don't want any other manual step, because every manual step is one more place PRs can sit.
    • The smoke URL list. When you ship a new externally-linked route or a manifest, you have to add it to scripts/smoke-urls.txt. The AI sanity check warns if you add a route file without updating the list, but the contract is human-maintained.
    • Visual proof callback wiring. We ship the workflow today but the callback handler (a Supabase Edge Function that writes results to a queryable table the GitHub Action polls) is Phase 2. Until then, visual-proof is informational, not blocking.
    • A Cogny "PRs waiting on you" surface. Today the loop runs on GitHub. A natural Phase 2 is to surface the same state — AI summary inline, one-click approve — inside Cogny itself, so the same person who approved a marketing report and approved a budget shift can approve a code PR from the same dashboard. The GitHub-native loop has to work first; this is polish on top.

    The takeaway

    If your PRs are sitting open for weeks, the problem usually isn't review. It's the gap between "approved" and "merged." That gap is fixable in an afternoon if you have the four primitives we used: a real preview environment per PR, an agent that can fix the build, a model that scores the deployment risk before review even starts, and a way to look at the page before it ships.

    We had all four sitting in the codebase for other reasons — Cogny Sites for our hosting product, Cogny Code (build-autofix and deployment-risk) for our ticket workflow, Visual Proof for our screenshot verification. The afternoon was just wiring them together with the right GitHub Actions glue and trusting the loop enough to ship the wiring through itself.

    The result: my only step in shipping any PR is one click. The follow-up commits don't miss the train anymore, because the PR opens with the gate already armed. The deploys don't race themselves. The workers don't fall behind on releases. The casing-sensitive URLs that have been silently 404'ing on prod for a week now show up in CI within seconds.

    Velocity isn't a vendor you buy. It's a loop you build.