Field Notes - Oct 28, '25

Executive Signals

Agents are the new juniors: spike docs and phased auth unlock safe autonomy
A/B is the new PR: cross-model reviewers and previews catch pre-merge bugs
Build where models learned: mainstream stacks yield higher acceptance and fewer rewrites
Evals over vibes: prompt search beats hand-tuning when metrics are real
Demos impress, plumbing saves: tracing, limits, and costs prevent production fire drills

In the agent era, code is abundant; shipping fails on ops. Update ladders so juniors handle deployments, secrets, CI/CD, observability, and stakeholder updates without hand-holding. Reserve senior attention for architecture, governance, and risk. Manage onboarding to reduce lead time to a first production change under two weeks, then tighten to days.

Require new hires to deploy, wire secrets, and add alerts solo
Publish a ready-to-ship checklist: environments, limits, logging, rollbacks
Track lead time to first prod change; intervene if it exceeds 10 days

Product

Evals Over Vibes For Prompts

Human-tuned system prompts are brittle. With a representative eval suite, models reliably discover better prompts via automated search than freehand writing. Treat prompts like code: versioned, tested, and locked. Ship only when the measured delta beats a defined threshold so changes compound instead of churn.

Build a 50–200 case eval set tied to product KPIs
Run nightly prompt search; freeze winners with checksums
Ship only when eval pass rate improves by at least five points

Engineering

Scaffold Agents With Spike Docs, Then Let Them Run

Agents behave like fast juniors: great with crisp specs, poor with ambiguity. Put an implementation “spike” in-repo—objectives, constraints, metrics, file paths, checklists—iterate it with an agent, then authorize phased execution. Track hands-off runtime to create multi-hour autonomous runs that compress “two sprints” into a day.

Require a two-page plan with file- and line-level references
Gate execution to Phase 1; promote only after passing checks
Track hands-off runtime; push for 90–180 minutes

Default To Model-Favored Stacks

Agent performance is distributional, not neutral. Stacks with the most training surface—typed TypeScript/React/Node and mainstream web infra—produce higher-quality outputs and fewer rewrites. If your “standard” fights the grain, agents become the bottleneck.

Default to TypeScript end-to-end, strict typing, mainstream libraries
Approve exceptions only with written rationale and mitigations
Measure agent edit rate; target over 70% of lines accepted

Cross-Model PR Reviews With Per-PR Previews

Models are strong at critiquing each other. Trigger reviews from two or three different agents per PR and spin up ephemeral previews with dedicated URLs and seeded databases. Validate primary user paths with automated browser flows before merge; track disagreement and pre-merge bug find rates.

Mandate at least two model reviewers; block on unresolved deltas
Give each PR a preview URL with seeded data and e2e smoke tests
Track disagreement and pre-merge bug find rates; improve monthly

Refactors: Parallelize And Select The Best

Refactors are ideal for agents: high toil, clear invariants, testable outcomes. Launch multiple parallel runs against the same target and keep the winner; the rest are cheap exploration. Bound regressions with coverage gates and type-strict builds, and kill flaky test generation quickly.

Kick 5–10 parallel runs; select by tests, bundle size, and performance
Enforce ≥80% coverage and type-strict builds
Timebox runs 12–24 hours; fail fast on flaky test generators

LLMOps First, Not After The Demo

Chat UIs are trivial; production AI is routing, feedback loops, evals, rate limits, tracing, and cost control. Teams stall when plumbing lags. Stand up the ops layer before user traffic and block GA until offline evals meet thresholds.

Ship tracing, feedback capture, and token/cost budgets before users
Add rate limits, red-team prompts, and abuse handling to v1
Block launch until offline evals clear predefined thresholds

Executive Signals

Agents are the new juniors: spike docs and phased auth unlock safe autonomy

A/B is the new PR: cross-model reviewers and previews catch pre-merge bugs

Build where models learned: mainstream stacks yield higher acceptance and fewer rewrites

Evals over vibes: prompt search beats hand-tuning when metrics are real

Demos impress, plumbing saves: tracing, limits, and costs prevent production fire drills

CEO

Require new hires to deploy, wire secrets, and add alerts solo

Publish a ready-to-ship checklist: environments, limits, logging, rollbacks

Track lead time to first prod change; intervene if it exceeds 10 days

Product

Build a 50–200 case eval set tied to product KPIs

Run nightly prompt search; freeze winners with checksums

Ship only when eval pass rate improves by at least five points

Engineering

Require a two-page plan with file- and line-level references

Gate execution to Phase 1; promote only after passing checks

Track hands-off runtime; push for 90–180 minutes

Default to TypeScript end-to-end, strict typing, mainstream libraries

Approve exceptions only with written rationale and mitigations

Measure agent edit rate; target over 70% of lines accepted

Mandate at least two model reviewers; block on unresolved deltas

Give each PR a preview URL with seeded data and e2e smoke tests

Track disagreement and pre-merge bug find rates; improve monthly

Kick 5–10 parallel runs; select by tests, bundle size, and performance

Enforce ≥80% coverage and type-strict builds

Timebox runs 12–24 hours; fail fast on flaky test generators

Ship tracing, feedback capture, and token/cost budgets before users

Add rate limits, red-team prompts, and abuse handling to v1

Block launch until offline evals clear predefined thresholds

Field Notes

Field Notes - Oct 28, '25

Executive Signals

CEO

Juniors Own Ops, Seniors Govern Risk

Product

Evals Over Vibes For Prompts

Engineering

Scaffold Agents With Spike Docs, Then Let Them Run

Default To Model-Favored Stacks

Cross-Model PR Reviews With Per-PR Previews

Refactors: Parallelize And Select The Best

LLMOps First, Not After The Demo

Field Notes

Field Notes - Oct 28, '25

Executive Signals

CEO

Juniors Own Ops, Seniors Govern Risk

Product

Evals Over Vibes For Prompts

Engineering

Scaffold Agents With Spike Docs, Then Let Them Run

Default To Model-Favored Stacks

Cross-Model PR Reviews With Per-PR Previews

Refactors: Parallelize And Select The Best

LLMOps First, Not After The Demo