Field Notes - Nov 19, '25

Executive Signals

Output is the new UI: taste systems and evals outrun chrome tweaks
Agents are the new APIs: audited handshakes with human approvals beat brittle connectors
Leaderboard gains, budget pains : model swaps only when users feel it
Security drag, delivery speed : risk-tiered controls and measured days beat gold-plating
Milestones as mirrors, not fireworks: demo plus retro, then dated next steps

Convert milestone days into a 60-minute show-and-tell plus honest retro. Prove the path with a small, real batch and a human-in-the-loop to capture timings and edge cases without pretending it’s ready to scale the same day. Leave with explicit acceptance gates and a dated plan for phase two.

Push 5–10 low-risk submissions end-to-end; record success rate, latency, handoffs
Agree acceptance gates for phase two (e.g., >95% first-pass success)
Replace the day’s standup with the demo/retro; exit with owners and dates

Ride the Model Curve, Not the Hype

Design for “free” gains without replatforming, but hold swaps to a business bar. Put a gateway in front of apps, keep business logic outside frameworks, and review models quarterly. Change only when task success, CSAT, or time-to-complete improves while cost per task stays within target, with pinned fallbacks and failover.

Run hidden A/Bs; promote only if KPI lift clears a cost ceiling
Maintain a pinned fallback; auto-failover on SLO breaches
Budget a 90‑day model review cadence behind a gateway

Small Lockbox Squads Build Durable Platforms

A tight, cross-functional squad moves faster than a matrixed org and leaves reusable assets behind. Their deliverable isn’t a one-off feature, it’s platform primitives the rest of the company can build on via versioned APIs.

Charter a 4–6 person squad for a 6‑week pilot
Deliver: AI gateway, eval service, prompt/config repo, safety filters, retrieval
Integrate only through versioned APIs to avoid dependency drag

Customer Success

Design Human Handoff and Graceful Degradation on Day One

Assume the agent will get stuck. Define when and how conversations move to people, and make failure states preserve trust. Target low handoff rates, fast response for takeovers, and full context transfer so humans resolve quickly and the agent learns from observation.

Expose a takeover control; summarize context and intent for humans
Keep MTTA/MTTR under two minutes for handoffs; track to target
Aim for <20% handoff at launch, <10% by day 60

Product

Output Is the UI

In AI products, response quality is the interface. Treat taste as a system: house-voice rubrics, golden datasets, and continuous side‑by‑side evaluations. Ship the eval harness before features so PMs and engineers raise quality weekly, not quarterly, and block releases on regression.

Build 100–300 canonical conversations; require ≥90% pass to release
Run weekly double‑blind evals; track win rate vs last version
Capture approve/edit annotations in-line; recycle into train/eval sets

Bounded Agents, Risk-Aware Launches

Marry LLM reasoning with deterministic workflows and validators. Choose the first surface by blast radius: start internal when errors are costly and irreversible, external when outcomes are reversible and guardrailed. Maintain kill switches and memory policies to keep actions safe and recoverable.

Gate GA on ≥90% task success and <5% unsafe output on a golden set
Whitelist state‑changing tools with explicit confirmations and validators
Add a per-surface kill switch for instant rollback

Two-Track Build: Intelligence and Interface

Stand up a test studio (tools, retrieval, validators, traces) while a separate track builds the minimal UI customers touch. Freeze the v1 UI early and swap brains behind a stable contract. Record every session to debug fast and keep a single compatibility spec for inputs, outputs, and errors.

Freeze v1 UI after week two; iterate brains behind a stable interface
Record traces, tool calls, and retrieval snippets for every session
Define one compatibility spec any agent implementation must satisfy

Agents Outgrow Rigid Integrations

Expect agent-to-agent workflows across vendors to evolve faster than bespoke connectors or rigid abstractions. Use signed handshakes, audit trails, and human approvals at boundaries so you can change tools without rewriting everything.

Pilot one cross‑vendor handshake with audit and rollback
Require manager approval before any cross‑entity commit
Log every tool/action pair for traceability

Week‑One Wins That Pay Back Immediately

Before moonshots, fix legibility, continuity, and cost. Make inputs machine-proof, preserve session context, stream progress, and route work to the cheapest capable model. Aim for measurable cost-per-conversation reduction within a quarter.

Add robust extraction for names/emails/phones/IDs; confirm back to users
Persist session context; stream partial results instead of spinners
Route by task to the cheapest capable model; cache hot prompts

Engineering

Status Taxonomy That Makes Automation Safe

Separate “how it’s processed” from “where it is.” Use an is_automated flag for provenance and a clear status path—New → In Progress → With Compliance—with Automation Failed for hard stops. Time‑in‑status becomes your early warning signal and prevents double work.

Lock records when In Progress; unlock only on success or failure
Page if time‑in‑status breaches a threshold; auto‑retry with backoff
Build a daily exceptions report from Automation Failed for triage

Right‑Size Security Without Stalling Delivery

Adopt tiered controls to avoid gold-plating. Ship P0–P1 for MVP (no keys on laptops, audited secrets, basic CI checks), and stage P2 hardening after validation. Measure security drag in days so trade‑offs are explicit and adjustable.

Pre‑approve checklists by risk tier; require exceptions only for out‑of‑tier asks
Track days added per control; prune or defer low‑ROI items
Revisit tiers post‑pilot; tighten where signal warrants

From Autocomplete to Code Agents

Developer flow is shifting from tab‑complete to task‑first agents. The chat/task surface plans work, the IDE executes and produces diffs, tests, and verification notes. Net effect: higher throughput with tighter review gates.

Standardize an agentic IDE output: plan, diffs, tests, verification log
Gate merges on agent + human reviews; require runnable repros for fixes
Track agent share of work and defect escape rate to prove ROI

Default to Prompt Changes Before Adding Code

In agentic RPA and extraction, code sprawl breeds fragility. Freeze schemas and first tune instructions, scope, and retries before adding helpers or flows. Require propose → approve → implement with trivial undo paths.

Enforce reviewed diffs; no merges without approval
Prioritize instruction edits and selector wording over new code
Maintain a one‑command rollback; document it in the repo

Anchor on Headers, Scope the DOM, Then Extract

Reliability jumps when you narrow the search surface. Find a human‑readable section header, climb to the nearest parent container, and restrict extraction to that subtree. Prefer semantic locators with explicit fallbacks and per‑step retries.

“Find header → closest parent container → limit queries to subtree”
Write instructions specific enough to exclude adjacent sections
If selector confidence < threshold, log and retry next‑best container

Keep the Main Flow Clean to Teach and Scale

Treat the repo as a learning surface. The primary function should read end‑to‑end: hoist schemas and constants up, keep a single main flow in the middle, and push helpers to the bottom. Minimize helper surface and pass typed inputs/outputs.

Top: schema/constants; Middle: single main flow; Bottom: helpers
Pass typed inputs; return typed outputs; keep helpers narrow
Add brief docstrings with example success and failure outputs