
Field Notes - Oct 28, '25
Executive Signals
- Agents are the new juniors: spike docs and phased auth unlock safe autonomy
- A/B is the new PR: cross-model reviewers and previews catch pre-merge bugs
- Build where models learned: mainstream stacks yield higher acceptance and fewer rewrites
- Evals over vibes: prompt search beats hand-tuning when metrics are real
- Demos impress, plumbing saves: tracing, limits, and costs prevent production fire drills
CEO
Juniors Own Ops, Seniors Govern Risk
In the agent era, code is abundant; shipping fails on ops. Update ladders so juniors handle deployments, secrets, CI/CD, observability, and stakeholder updates without hand-holding. Reserve senior attention for architecture, governance, and risk. Manage onboarding to reduce lead time to a first production change under two weeks, then tighten to days.
- Require new hires to deploy, wire secrets, and add alerts solo
- Publish a ready-to-ship checklist: environments, limits, logging, rollbacks
- Track lead time to first prod change; intervene if it exceeds 10 days
Product
Evals Over Vibes For Prompts
Human-tuned system prompts are brittle. With a representative eval suite, models reliably discover better prompts via automated search than freehand writing. Treat prompts like code: versioned, tested, and locked. Ship only when the measured delta beats a defined threshold so changes compound instead of churn.
- Build a 50–200 case eval set tied to product KPIs
- Run nightly prompt search; freeze winners with checksums
- Ship only when eval pass rate improves by at least five points
Engineering
Scaffold Agents With Spike Docs, Then Let Them Run
Agents behave like fast juniors: great with crisp specs, poor with ambiguity. Put an implementation “spike” in-repo—objectives, constraints, metrics, file paths, checklists—iterate it with an agent, then authorize phased execution. Track hands-off runtime to create multi-hour autonomous runs that compress “two sprints” into a day.
- Require a two-page plan with file- and line-level references
- Gate execution to Phase 1; promote only after passing checks
- Track hands-off runtime; push for 90–180 minutes
Default To Model-Favored Stacks
Agent performance is distributional, not neutral. Stacks with the most training surface—typed TypeScript/React/Node and mainstream web infra—produce higher-quality outputs and fewer rewrites. If your “standard” fights the grain, agents become the bottleneck.
- Default to TypeScript end-to-end, strict typing, mainstream libraries
- Approve exceptions only with written rationale and mitigations
- Measure agent edit rate; target over 70% of lines accepted
Cross-Model PR Reviews With Per-PR Previews
Models are strong at critiquing each other. Trigger reviews from two or three different agents per PR and spin up ephemeral previews with dedicated URLs and seeded databases. Validate primary user paths with automated browser flows before merge; track disagreement and pre-merge bug find rates.
- Mandate at least two model reviewers; block on unresolved deltas
- Give each PR a preview URL with seeded data and e2e smoke tests
- Track disagreement and pre-merge bug find rates; improve monthly
Refactors: Parallelize And Select The Best
Refactors are ideal for agents: high toil, clear invariants, testable outcomes. Launch multiple parallel runs against the same target and keep the winner; the rest are cheap exploration. Bound regressions with coverage gates and type-strict builds, and kill flaky test generation quickly.
- Kick 5–10 parallel runs; select by tests, bundle size, and performance
- Enforce ≥80% coverage and type-strict builds
- Timebox runs 12–24 hours; fail fast on flaky test generators
LLMOps First, Not After The Demo
Chat UIs are trivial; production AI is routing, feedback loops, evals, rate limits, tracing, and cost control. Teams stall when plumbing lags. Stand up the ops layer before user traffic and block GA until offline evals meet thresholds.
- Ship tracing, feedback capture, and token/cost budgets before users
- Add rate limits, red-team prompts, and abuse handling to v1
- Block launch until offline evals clear predefined thresholds