homeblogabout
  • rss

  • twitter

  • linkedin

© 2025

Field Notes

Field Notes are fast, from-the-trenches observations. Time-bound and may age poorly. Summarized from my real notes by . Optimized for utility. Not investment or legal advice.

Notebook background
░░░░░░░▄█▄▄▄█▄
▄▀░░░░▄▌─▄─▄─▐▄░░░░▀▄
█▄▄█░░▀▌─▀─▀─▐▀░░█▄▄█
░▐▌░░░░▀▀███▀▀░░░░▐▌
████░▄█████████▄░████
=======================
Field Note Clanker
=======================
⏺ Agent start
│
├── 1 data sources
└── Total 10.4k words
⏺ Spawning 1 Sub-Agents
│
├── GPT-5: Summarize → Web Search Hydrate
├── GPT-5-mini: Score (Originality, Relevance)
└── Return Good Notes
⏺ Field Note Agent
│
├── Sorted to 3 of 7 sections
├── Extracting 5 key signals
└── Posting Approval
⏺ Publishing
┌────────────────────────────────────────┐
│ Warning: Field notes are recursively │
│ summarized by agents. These likely age │
│ poorly. Exercise caution when reading. │
└────────────────────────────────────────┘

Field Notes - Oct 28, '25

Executive Signals

  • Agents are the new juniors: spike docs and phased auth unlock safe autonomy
  • A/B is the new PR: cross-model reviewers and previews catch pre-merge bugs
  • Build where models learned: mainstream stacks yield higher acceptance and fewer rewrites
  • Evals over vibes: prompt search beats hand-tuning when metrics are real
  • Demos impress, plumbing saves: tracing, limits, and costs prevent production fire drills

CEO

Juniors Own Ops, Seniors Govern Risk

In the agent era, code is abundant; shipping fails on ops. Update ladders so juniors handle deployments, secrets, CI/CD, observability, and stakeholder updates without hand-holding. Reserve senior attention for architecture, governance, and risk. Manage onboarding to reduce lead time to a first production change under two weeks, then tighten to days.

  • Require new hires to deploy, wire secrets, and add alerts solo
  • Publish a ready-to-ship checklist: environments, limits, logging, rollbacks
  • Track lead time to first prod change; intervene if it exceeds 10 days

Product

Evals Over Vibes For Prompts

Human-tuned system prompts are brittle. With a representative eval suite, models reliably discover better prompts via automated search than freehand writing. Treat prompts like code: versioned, tested, and locked. Ship only when the measured delta beats a defined threshold so changes compound instead of churn.

  • Build a 50–200 case eval set tied to product KPIs
  • Run nightly prompt search; freeze winners with checksums
  • Ship only when eval pass rate improves by at least five points

Engineering

Scaffold Agents With Spike Docs, Then Let Them Run

Agents behave like fast juniors: great with crisp specs, poor with ambiguity. Put an implementation “spike” in-repo—objectives, constraints, metrics, file paths, checklists—iterate it with an agent, then authorize phased execution. Track hands-off runtime to create multi-hour autonomous runs that compress “two sprints” into a day.

  • Require a two-page plan with file- and line-level references
  • Gate execution to Phase 1; promote only after passing checks
  • Track hands-off runtime; push for 90–180 minutes

Default To Model-Favored Stacks

Agent performance is distributional, not neutral. Stacks with the most training surface—typed TypeScript/React/Node and mainstream web infra—produce higher-quality outputs and fewer rewrites. If your “standard” fights the grain, agents become the bottleneck.

  • Default to TypeScript end-to-end, strict typing, mainstream libraries
  • Approve exceptions only with written rationale and mitigations
  • Measure agent edit rate; target over 70% of lines accepted

Cross-Model PR Reviews With Per-PR Previews

Models are strong at critiquing each other. Trigger reviews from two or three different agents per PR and spin up ephemeral previews with dedicated URLs and seeded databases. Validate primary user paths with automated browser flows before merge; track disagreement and pre-merge bug find rates.

  • Mandate at least two model reviewers; block on unresolved deltas
  • Give each PR a preview URL with seeded data and e2e smoke tests
  • Track disagreement and pre-merge bug find rates; improve monthly

Refactors: Parallelize And Select The Best

Refactors are ideal for agents: high toil, clear invariants, testable outcomes. Launch multiple parallel runs against the same target and keep the winner; the rest are cheap exploration. Bound regressions with coverage gates and type-strict builds, and kill flaky test generation quickly.

  • Kick 5–10 parallel runs; select by tests, bundle size, and performance
  • Enforce ≥80% coverage and type-strict builds
  • Timebox runs 12–24 hours; fail fast on flaky test generators

LLMOps First, Not After The Demo

Chat UIs are trivial; production AI is routing, feedback loops, evals, rate limits, tracing, and cost control. Teams stall when plumbing lags. Stand up the ops layer before user traffic and block GA until offline evals meet thresholds.

  • Ship tracing, feedback capture, and token/cost budgets before users
  • Add rate limits, red-team prompts, and abuse handling to v1
  • Block launch until offline evals clear predefined thresholds
PreviousOct 27, 2025
NextOct 29, 2025
Back to Blog