flue: add generic code review alongside style-guide review#31562
Open
mvvmm wants to merge 30 commits into
Open
flue: add generic code review alongside style-guide review#31562mvvmm wants to merge 30 commits into
mvvmm wants to merge 30 commits into
Conversation
Adds a bonk-style generic engineering review that runs in the same
orchestrator run as the existing style-guide review and posts findings
in the same bot comment, now split into "### Code Review" and
"### Style Guide Review" sections.
- New code-review skill: reviews all changed files (including code
examples in MDX) for bugs, error handling, security, dead code, and
maintainability. No style/prose checks; defers CI-caught issues.
- New in-process fan-out (lib/code-review-inproc.ts) mirroring the
style-guide one: per-file sessions, concurrency 5, cap 20 files.
- Code findings add a `critical` severity above warning/suggestion and
use the CR- id namespace; reconcile skill preserves it.
- Code-review agent gets full-file context via GitHub-API-backed tools
(read_repo_file defaults to the PR head SHA, search_repo).
- The repo root AGENTS.md is fetched at the head SHA and injected as
agent instructions every run, since the Worker has no repo checkout
to discover it from.
- Orchestrator runs both reviews concurrently (one failing degrades to
empty rather than aborting), reconciles per stream, persists
{ code, style } findings to R2 (legacy bare-array tolerated), and
renders one unified comment.
Cadence is unchanged: this rides the existing code-review-orchestrator
triggers (auto on PRs, /review, /full-review).
Contributor
|
This pull request requires reviews from CODEOWNERS as it changes files that match the following patterns:
|
Contributor
|
Preview URL: https://df8c587c.preview.developers.cloudflare.com |
Fixes surfaced by the bot reviewing its own PR (#31562): - getRepoFileContent: decode base64 as UTF-8 via TextDecoder (atob alone produced mojibake for non-ASCII content such as em dashes in AGENTS.md); URL-encode path segments; return null only on 404 and throw on other non-2xx so "absent" is distinct from "failed to load". - Fetch the root AGENTS.md from the PR base ref instead of the head SHA. The content is injected into agent instructions, so reading it from the head let a PR poison AGENTS.md to inject instructions into the reviewer. - Render: a degraded (failed) review section now shows a failure note instead of "No issues found", and the status line notes the partial failure; carried-forward findings still render. Orchestrator passes the per-stream failure flags into renderComment. - Render: match acknowledged findings by stable id, not path:line:rule, so two findings sharing those fields are not both dropped. - github-repo-tools.ts: update stale file header (now shared by the Dependabot and code-review agents).
Running the code-review and style-guide fan-outs concurrently kept ~10 model sessions live in a single Durable Object isolate and over-ran its 128 MB limit, producing repeated "isolate exceeded its memory limit and was reset" errors (and downstream event-stream append failures). - Run the two fan-outs sequentially instead of via Promise.all, so one fan-out's sessions are reclaimed before the next starts (roughly halves peak heap). Per-stream failure isolation is preserved. - Lower code-review concurrency from 5 to 3, since its sessions are heavier (full-file reads + injected AGENTS.md) than style-guide ones. Trade-off: review wall-clock is now the sum of the two fan-outs rather than the max — acceptable for a background review bot.
Move the code-review and style-guide fan-outs out of the orchestrator DO into their own discovered workflows, so each runs in its own Durable Object isolate (own 128 MB budget). This removes the shared-isolate memory contention that caused "isolate exceeded its memory limit and was reset" errors and restores parallel execution. - New workflows code-review-specialist + style-guide-specialist (FlueCodeReviewSpecialistWorkflow / FlueStyleGuideSpecialistWorkflow). Each self-fetches its own diff once (no per-file fetch, no R2 diff staging), stages it into its own workspace, runs the existing in-process fan-out (reused verbatim), and returns findings as its run result. - Orchestrator now decides diff mode only (no file fetch), admits both specialists via admitWorkflow and polls them concurrently via pollRun (the spam-filter pattern). A timed-out/errored specialist degrades to an empty section with its failure flag set; both failing -> failure comment. - Specialists self-heal incremental -> full diff if the base SHA is gone. - Code-review concurrency restored to 5 (own isolate now). - New shared lib/review-specialist.ts: payload contract + helpers. - writeDiffToWorkspace accepts a minimal DiffPullRequest shape. - wrangler v5 migration adds the two new SQLite DO classes. Supersedes the sequential-fan-out workaround; the fan-outs no longer run in the orchestrator DO at all.
The code-review specialist timed out on large PRs. AI Gateway showed all
model calls succeeding (0 failures) but slow (p90 ~47s, p99 ~103s), so the
cause was DO memory pressure plus total wall-clock, not model errors:
- A 13-file run at concurrency 5 hard-OOM'd the specialist isolate
("SQL query failed: ... memory limit") and never reached run_end.
- Even without OOM, 13 multi-turn file sessions exceeded the orchestrator's
10-minute poll deadline.
Fixes:
- CODE_REVIEW_MAX_FILES 20 -> 10. Each code-review file is a multi-turn
agent session, so this bounds both peak isolate memory and wall-clock.
Covers virtually all docs PRs; larger PRs review their 10 largest files.
- Orchestrator specialist poll timeout 10m -> 20m, since a legitimate large
review is many slow model calls and the specialist runs durably in its
own DO.
- Concurrency stays at 3 (memory).
Delete each per-file agent session as soon as it finishes so harness memory is bounded to ~concurrency live sessions instead of growing with the changed-file count. With the leak fixed, restore the code-review cap to 20 files and concurrency to 5, and drop the dead reserveTokens hint.
Pre-stage added lines + file content into skill args (code-review-inproc.ts, code-review-specialist.ts, SKILL.md): trusted TypeScript now parses the diff and fetches the full file at head SHA before each session opens, eliminating the mandatory setup turns (read manifest → read patch → parse lines → read_repo_file). Cross-file tools remain available for optional lookups. Removes diff workspace staging from the code-review path entirely. Add external cron watchdog (review-watchdog.ts, review-watchdog-decide.ts, run-state.ts): a scheduled handler fires every 2 minutes, detects reviews stuck past a deadline via inflight R2 markers, and re-drives them. Workflows are non-resumable after DO interruptions; the watchdog is the only reliable recovery path. Inflight markers are written at review start and cleared on completion (code-review-state.ts, code-review-orchestrator.ts). Per-file hard timeout now aborts the CallHandle before deleting the session (code-review-inproc.ts, style-guide-inproc.ts): session.delete() rejects while an operation is active, so abort must settle the operation first. Fixes zombie session leak on timeout. Env-overridable concurrency and file timeout (lib/env.ts): both specialists read CODE_REVIEW_CONCURRENCY, CODE_REVIEW_FILE_TIMEOUT_MS, STYLE_GUIDE_CONCURRENCY, STYLE_GUIDE_FILE_TIMEOUT_MS so local dev can run lower values without touching prod defaults. Staged-diff cleanup in style-guide-specialist.ts: removeWorkspacePath in try/finally so the run-scoped SQLite rows are freed after each review. Added removeWorkspacePath to connectors/cloudflare-shell.ts. Wrangler: cron trigger every 2 min, DOCS_FLUE_BASE_URL var for watchdog. Tooling: flue:reset:local script, NODE_OPTIONS on flue:dev, fixed clear-r2-pr-data.ts to cover both state paths and inflight/ prefix.
The scheduled watchdog was never functional in prod: Flue drops non-HTTP handlers from app.ts (they require cloudflare.ts), so the cron fired every 2 minutes and logged "Handler does not export a scheduled() function" 30 times per hour without ever invoking runReviewWatchdog(). Reviews stopped getting stuck independently — the skill rewrite + per-file session.delete() + concurrency limits reduced orchestrator interruptions to near-zero. The watchdog's run-state detection and re-admission path were also untested end-to-end, so enabling it carried real risk. Removes: review-watchdog.ts, review-watchdog-decide.ts, run-state.ts, all inflight marker API from code-review-state.ts, all marker call sites from code-review-orchestrator.ts, triggers.crons + vars.DOCS_FLUE_BASE_URL from wrangler.jsonc. Reverts app.ts to canonical export default app. Keeps: markAutoReviewCompleted, renderFailureComment, bypassReviewLimit (all used independently). bin/clear-r2-pr-data.ts inflight/ handling kept for one-time cleanup of orphaned prod R2 markers.
An absent triggers key leaves existing cron triggers intact on deploy; triggers.crons=[] forces wrangler to PUT an empty schedule list, actually deregistering the leftover */2 cron.
1. Extract withConcurrency to lib/inproc-utils.ts; import from both
code-review-inproc.ts and style-guide-inproc.ts instead of duplicating.
2. Fix read_repo_file tool UTF-8 decoding: use TextDecoder instead of bare
atob() so non-ASCII content (em dashes, CJK, smart quotes) is not
mojibake. Also encode path segments with encodeURIComponent, matching
the getRepoFileContent fix already applied to github.ts.
3. Fix parseAddedLines: skip lines starting with backslash ('\ No newline
at end of file') instead of treating them as context lines. Previously
this incremented newLine, producing off-by-one line numbers for every
subsequent added line in files without a trailing newline.
4. Fix stale comments: code-review-diff.ts, style-guide-inproc.ts, and
wrangler.jsonc v4 comment still described the old in-orchestrator-DO
architecture; updated to reflect the specialist-DO split.
…d-mode commands
Routes code review to one of two strategies based on combined diff size:
Fan-out (≤50 KB): per-file sessions, concurrency 5, cap 20 files. Good for
smaller PRs; maximum per-file detail, parallel sessions, per-file degradation.
Holistic (>50 KB): single session over the entire PR diff in one pass, cap 50
files (lib/code-review-holistic.ts). Enables cross-file reasoning, no per-file
setup overhead, faster and more reliable on large PRs where fan-out wedges.
New code-review-holistic skill: rich substance from the per-file skill, reframed
for whole-PR review, model-owned navigation.
Comment heading reflects which ran: 'Holistic Code Review' or
'Fan Out Code Review'. Both paths log inputTokens/totalTokens for calibration.
Two new codeowner slash commands:
/fan-out-review — forces fan-out regardless of diff size (with warning in
the comment that it may be slow/fail on large PRs)
/holistic-review — forces holistic regardless of diff size
forceReviewMode threads through orchestrate → orchestrator → review-specialist
→ code-review-specialist, overriding size-based routing when set.
Also: selectCodeReviewFiles takes an optional maxFiles param (holistic passes
50, fan-out uses the existing 20 default); per-file usage logging (inputTokens,
totalTokens) added to reviewSingleFile; progress logs committed from the
previous session also included here.
…; fix partial run consuming auto-review slot /fan-out-review and /holistic-review now check getPullRequest().user.login for dependabot[bot] before dispatching, matching the /review and /full-review pattern. Dependabot PRs route to /workflows/dependabot-review without forceReviewMode; regular PRs keep the forced mode. markAutoReviewCompleted now only fires when both codeOutcome.ok and styleOutcome.ok are true. Previously a partial run (one specialist degraded) still consumed an auto-review slot, which blocked the retry the comment promised would happen on the next push.
… workflow Remove the 20-minute specialist poll from the code-review orchestrator (root cause of stuck reviews when the orchestrator DO was interrupted mid-wait). Replace with an event-driven finalize pattern: - Orchestrator is now dispatch-only: limit check, placeholder, diffMode, write context.json to R2, admit both specialists fire-and-forget, return. - Each specialist wraps its review in try/catch (degraded result on error), writes its stream result to R2, then races for a conditional-PUT finalize lock (If-None-Match: *). Exactly one specialist wins. - New finalize-review workflow (FlueFinalizeReviewWorkflow, v6 migration): reads context + stream results from R2, head-guards against newer pushes, idempotency-guards against double-finalize, reconciles both streams, persists review JSON, renders and posts the comment, swaps reactions, marks auto-review slot consumed, cleans up the pending namespace. - reviewMode is carried through context.json so finalize always uses the same mode the orchestrator decided on (fixes log-vs-comment split in local dev where DO env views can differ). - dispatchId scoping isolates concurrent same-head dispatches (e.g. auto-review + /full-review arriving simultaneously). - Zero new custom Durable Object bindings beyond the standard Flue workflow class (FlueFinalizeReviewWorkflow in wrangler.jsonc v6).
…lize A specialist DO that is hard-evicted mid-review never reaches the writeStreamResult call at the end, leaving no code.json/style.json for the other specialist to detect and claim the finalize lock — exactly the stuck-review scenario seen on PR #31736. Fix: each specialist now writes { ok: false, degradedResult } to R2 BEFORE starting its review. If the DO is evicted, the placeholder is already there. The placeholder is overwritten with the real result on success. Either way finalize always has two stream results and will run, posting a degraded-section comment rather than staying stuck forever. Also fixes the unused FinalizeContext import lint error.
The placeholder writes added in the previous commit caused a race: both specialists write their degraded placeholders within ~1-2s, the first to write checks the sibling's placeholder (which looks like a valid result), claims the lock, and finalize fires immediately with two ok:false results before either review ran — posting an instant failure comment. Fix: - Add final:boolean to StreamResultPayload (false=placeholder, true=real) - tryClaimFinalize now checks final:true on the sibling — a placeholder (final:false) is not treated as 'done', preventing premature triggering - Orchestrator writes both crash-protection placeholders (final:false) BEFORE admitting specialists so a key exists even if a specialist DO is evicted immediately after admission - Specialists write final:true in their rendezvous tail (removed the redundant per-specialist placeholder writes added previously)
…refactors
A. Robustness (stuck-review fixes):
- Wrap all specialist setup (token, diff fetch) + review in one try/catch;
any setup failure now writes a final:true degraded result and triggers
finalize rather than throwing and leaving the placeholder stuck at final:false.
Extract shared reportSpecialistResult() helper into lib/finalize-rendezvous.ts
to eliminate the duplicated rendezvous tail across both specialists.
- Orchestrator admit failures overwrite the placeholder with final:true degraded;
if both fail, orchestrator itself claims the finalize lock and admits finalize.
Drop redundant type casts on the narrowed admit union.
- finalize-review post-comment failure now returns {finalized:false} and skips
markAutoReviewCompleted and cleanupPending so the next push can retry without
burning an auto-review slot on an undelivered comment.
- Degraded streams (ok:false) skip reconciliation and carry previous findings
forward as active, preventing false resolution of prior issues.
B. Defensive R2 / JSON:
- Wrap obj.json() in try/catch in readContext, readStreamResult, and
tryClaimFinalize; treat corrupted/partial writes as null/not-final.
C. Safety:
- Guard req in orchestrator with an explicit throw instead of req!.
- safeOrigin() helper returns '' on URL parse failure in both specialists.
- parseReviewSpecialistPayload validates pr field shapes and DiffMode union
(incremental requires fromSha+toSha strings); normalizes baseUrl to a
valid absolute http(s) origin via new URL().origin, dropping invalid values.
D. Correctness:
- Holistic reviewedFiles now built from patched files only, matching the diff
actually sent to the model.
- clearTimeout called immediately after await handle resolves in both holistic
and per-file reviewSingleFile, preventing a late timer from mislabeling a
post-resolve error as a timeout.
- Finding ID hash widened from 6 to 12 hex chars (24→48 bit) in both CR-/SG-
namespaces; line number deliberately kept excluded for cross-commit stability.
- withConcurrency clamps limit to Math.max(1, Math.floor(limit)).
E. Dev script (bin/clear-r2-pr-data.ts):
- WHERE clause tightened to 'diffs/pr-%' and 'inflight/pr-%' (was 'diffs/%').
- Switch execSync string interpolation to execFileSync with argument arrays.
- Freshen stale header comment.
F. Refactors:
- Collapse four near-identical /review /full-review /fan-out-review
/holistic-review handlers in orchestrate.ts into one shared helper.
- Merge duplicate cloudflare-shell imports in style-guide-specialist.ts.
G. Docs:
- AGENTS.md: add finalize-review row to workflow table; rewrite request flow
to describe dispatch-only orchestrator, R2 rendezvous, final:bool flag, and
finalize-review steps; add v6 migration; update slash-command list.
…th spiral The FlueCodeReviewSpecialistWorkflow and FlueStyleGuideSpecialistWorkflow DOs accumulated SQLite state from previous crashed runs. Each OOM leaves orphaned session data in the DO's storage. After several crashes the accumulated state is large enough that even the Flue alarm recovery handler OOMs on startup (~900ms, repeating every ~30-60s), creating a death spiral where no future review can complete. Fix: two-step migration to reset the storage: v7 — rename bloated instances to *Old (abandons their storage) v8 — recreate the classes with empty storage under the original names The wrangler bindings still point to the original names (FlueCodeReview- SpecialistWorkflow / FlueStyleGuideSpecialistWorkflow), so no code changes are needed. Run history prior to v7 is lost.
… OOM death spiral" This reverts commit fa5e960.
… rendezvous, style-guide rule additions - Generalise finalize rendezvous from 2 streams to N streams (EXPECTED_STREAMS constant; tryClaimFinalize loops all siblings in parallel; stream type widened to string; expectedStreams threaded through context, payload, and options) - Add conventions specialist (light AI session, conventions-check skill): checks PR title format, Summary content against fetched template, and redirect checklist when docs files are renamed/deleted - Add redirect specialist (pure TypeScript, no model): derives old+new URLs from file paths, skips format-only moves (foo.mdx → foo/index.mdx), splat-aware __redirects matching, degrades gracefully if file unreadable - Redirects reconcile deterministically in finalize (no LLM); conventions force full diff mode; bothFailed gate remains code+style only - Add previous_filename to PullRequestFile for correct rename handling - Add marketing-language and time-sensitive-content suggestions to style-guide core-content reference; add component should-use suggestions to code-blocks - Render: 4 sections (Code → Conventions → Style → Redirects); pr path renders as PR label; no-findings messages use <sub> (smaller, no emoji); top-level status line unchanged - wrangler.jsonc v7: new_sqlite_classes for both new DO classes - Fix idempotency guard to skip in log mode (guard is comment-mode only)
Codeowners can comment /disable-auto-review on a PR to stop push-triggered reviews from running. The flag is stored in R2 at diffs/pr-<n>/auto-review-disabled.json (same pattern as ignore-review-limit). Manual /review and /full-review commands (bypassReviewLimit=true) still work normally. The command is acknowledged with a thumbs-up reaction.
Holistic mode timed out on large PRs (>50 KB diff, single model pass, 15-min hard timeout) producing zero findings. Fan-out is strictly better: bounded per-file sessions, graceful per-file degradation, guaranteed completion on any diff size. - Delete lib/code-review-holistic.ts and .agents/skills/code-review-holistic/ - code-review-specialist.ts: always fan-out; remove routing logic, holistic imports, env reads, forceReviewMode handling - lib/code-review-inproc.ts: remove CODE_REVIEW_HOLISTIC_* constants and re-export; remove reviewMode from both return sites - lib/code-review-results.ts: remove reviewMode field from CodeReviewResult - lib/review-specialist.ts: remove forceReviewMode from payload type + parser - code-review-orchestrator.ts: remove forceReviewMode from payload + specialistBody - orchestrate.ts: remove /fan-out-review and /holistic-review commands - code-review-render.ts: heading always 'Code Review'; remove codeMode from RenderReviewInput; remove /fan-out-review and /holistic-review command rows - finalize-review.ts: remove codeMode from renderComment call - AGENTS.md: remove holistic routing description and command bullets
Contributor
Author
|
/full-review |
Contributor
ReviewReviewing new changes (commit
Code ReviewThis code review is in beta and may not always be helpful — use your judgment. Warnings (22)
Suggestions (5)
ConventionsChecks PR title, description, and redirect checklist. Warnings (1)
Style Guide ReviewNo style-guide issues found. RedirectsNo missing redirect entries found. CommandsOnly codeowners can run commands. Post a comment with the command to trigger it.
|
Replace rigid format rules (title must start with [Product] or type:, description must have a ### Summary section) with lenient semantic checks: 1. Product/area identified — title OR description names the area; no tag required 2. Description explains the work — any human prose passes; only flag truly empty/placeholder descriptions 3. Scope accuracy — only flag material misrepresentation, not missing details Default to no finding in all three. Only flag clear, significant violations.
Contributor
Author
|
/disable-auto-review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a generic engineering code review to the Flue bot. Both the code review and existing style-guide review now post to a single bot comment split into
### Code Reviewand### Style Guide Reviewsections.Architecture
Both reviewers run as separate specialist workflows (
code-review-specialist,style-guide-specialist), each in their own Durable Object isolate with its own 128 MB memory budget. The orchestrator admits both and polls them concurrently; one failing degrades to an empty section rather than aborting the whole review.Code review specialist
args.addedLinesandargs.fileContent, eliminating the mandatory setup turns the model previously spent reading manifests and fetching files.read_repo_file(defaults to PR head SHA) andsearch_reporemain available as optional tools for cross-file lookups (callers, import sites, etc.).AGENTS.mdis fetched from the PR base ref and injected as agent instructions so the reviewer has repo conventions without a checkout. (Base ref, not head, to prevent a PR from poisoning the reviewer's instructions.)critical/warning/suggestionseverity hierarchy withCR-IDs.Style guide specialist
src/content/(docs|partials|changelog)/**.mdx, capped at 20 files.codetool.warning/suggestionseverities only.Memory and timeout
session.delete()), bounding peak heap to ~concurrency live sessions rather than growing with the file count.CODE_REVIEW_FILE_TIMEOUT_MS/STYLE_GUIDE_FILE_TIMEOUT_MS): on timeout, theCallHandleis aborted first (settling the operation) and then the session is deleted, preventing zombie session leaks.CODE_REVIEW_CONCURRENCY,STYLE_GUIDE_CONCURRENCY) so local dev can run lower values without touching prod defaults.Other
code,style) independently against its prior findings + human comments; persists{ code, style }to R2 (legacy bare-array shape tolerated)./review,/full-review, and/ignore-review-limitcodeowner commands.Validated
tsc --noEmit,flue build, prettier. Verified end-to-end inlogmode andcommentmode against real PRs.