
Field Notes - Nov 19, '25
Executive Signals
- Output is the new UI: taste systems and evals outrun chrome tweaks
- Agents are the new APIs: audited handshakes with human approvals beat brittle connectors
- Leaderboard gains, budget pains : model swaps only when users feel it
- Security drag, delivery speed : risk-tiered controls and measured days beat gold-plating
- Milestones as mirrors, not fireworks: demo plus retro, then dated next steps
CEO
Treat Milestones as Demos, Not Launches
Convert milestone days into a 60-minute show-and-tell plus honest retro. Prove the path with a small, real batch and a human-in-the-loop to capture timings and edge cases without pretending it’s ready to scale the same day. Leave with explicit acceptance gates and a dated plan for phase two.
- Push 5–10 low-risk submissions end-to-end; record success rate, latency, handoffs
- Agree acceptance gates for phase two (e.g., >95% first-pass success)
- Replace the day’s standup with the demo/retro; exit with owners and dates
Ride the Model Curve, Not the Hype
Design for “free” gains without replatforming, but hold swaps to a business bar. Put a gateway in front of apps, keep business logic outside frameworks, and review models quarterly. Change only when task success, CSAT, or time-to-complete improves while cost per task stays within target, with pinned fallbacks and failover.
- Run hidden A/Bs; promote only if KPI lift clears a cost ceiling
- Maintain a pinned fallback; auto-failover on SLO breaches
- Budget a 90‑day model review cadence behind a gateway
Small Lockbox Squads Build Durable Platforms
A tight, cross-functional squad moves faster than a matrixed org and leaves reusable assets behind. Their deliverable isn’t a one-off feature, it’s platform primitives the rest of the company can build on via versioned APIs.
- Charter a 4–6 person squad for a 6‑week pilot
- Deliver: AI gateway, eval service, prompt/config repo, safety filters, retrieval
- Integrate only through versioned APIs to avoid dependency drag
Customer Success
Design Human Handoff and Graceful Degradation on Day One
Assume the agent will get stuck. Define when and how conversations move to people, and make failure states preserve trust. Target low handoff rates, fast response for takeovers, and full context transfer so humans resolve quickly and the agent learns from observation.
- Expose a takeover control; summarize context and intent for humans
- Keep MTTA/MTTR under two minutes for handoffs; track to target
- Aim for <20% handoff at launch, <10% by day 60
Product
Output Is the UI
In AI products, response quality is the interface. Treat taste as a system: house-voice rubrics, golden datasets, and continuous side‑by‑side evaluations. Ship the eval harness before features so PMs and engineers raise quality weekly, not quarterly, and block releases on regression.
- Build 100–300 canonical conversations; require ≥90% pass to release
- Run weekly double‑blind evals; track win rate vs last version
- Capture approve/edit annotations in-line; recycle into train/eval sets
Bounded Agents, Risk-Aware Launches
Marry LLM reasoning with deterministic workflows and validators. Choose the first surface by blast radius: start internal when errors are costly and irreversible, external when outcomes are reversible and guardrailed. Maintain kill switches and memory policies to keep actions safe and recoverable.
- Gate GA on ≥90% task success and <5% unsafe output on a golden set
- Whitelist state‑changing tools with explicit confirmations and validators
- Add a per-surface kill switch for instant rollback
Two-Track Build: Intelligence and Interface
Stand up a test studio (tools, retrieval, validators, traces) while a separate track builds the minimal UI customers touch. Freeze the v1 UI early and swap brains behind a stable contract. Record every session to debug fast and keep a single compatibility spec for inputs, outputs, and errors.
- Freeze v1 UI after week two; iterate brains behind a stable interface
- Record traces, tool calls, and retrieval snippets for every session
- Define one compatibility spec any agent implementation must satisfy
Agents Outgrow Rigid Integrations
Expect agent-to-agent workflows across vendors to evolve faster than bespoke connectors or rigid abstractions. Use signed handshakes, audit trails, and human approvals at boundaries so you can change tools without rewriting everything.
- Pilot one cross‑vendor handshake with audit and rollback
- Require manager approval before any cross‑entity commit
- Log every tool/action pair for traceability
Week‑One Wins That Pay Back Immediately
Before moonshots, fix legibility, continuity, and cost. Make inputs machine-proof, preserve session context, stream progress, and route work to the cheapest capable model. Aim for measurable cost-per-conversation reduction within a quarter.
- Add robust extraction for names/emails/phones/IDs; confirm back to users
- Persist session context; stream partial results instead of spinners
- Route by task to the cheapest capable model; cache hot prompts
Engineering
Status Taxonomy That Makes Automation Safe
Separate “how it’s processed” from “where it is.” Use an is_automated flag for provenance and a clear status path—New → In Progress → With Compliance—with Automation Failed for hard stops. Time‑in‑status becomes your early warning signal and prevents double work.
- Lock records when In Progress; unlock only on success or failure
- Page if time‑in‑status breaches a threshold; auto‑retry with backoff
- Build a daily exceptions report from Automation Failed for triage
Right‑Size Security Without Stalling Delivery
Adopt tiered controls to avoid gold-plating. Ship P0–P1 for MVP (no keys on laptops, audited secrets, basic CI checks), and stage P2 hardening after validation. Measure security drag in days so trade‑offs are explicit and adjustable.
- Pre‑approve checklists by risk tier; require exceptions only for out‑of‑tier asks
- Track days added per control; prune or defer low‑ROI items
- Revisit tiers post‑pilot; tighten where signal warrants
From Autocomplete to Code Agents
Developer flow is shifting from tab‑complete to task‑first agents. The chat/task surface plans work, the IDE executes and produces diffs, tests, and verification notes. Net effect: higher throughput with tighter review gates.
- Standardize an agentic IDE output: plan, diffs, tests, verification log
- Gate merges on agent + human reviews; require runnable repros for fixes
- Track agent share of work and defect escape rate to prove ROI
Default to Prompt Changes Before Adding Code
In agentic RPA and extraction, code sprawl breeds fragility. Freeze schemas and first tune instructions, scope, and retries before adding helpers or flows. Require propose → approve → implement with trivial undo paths.
- Enforce reviewed diffs; no merges without approval
- Prioritize instruction edits and selector wording over new code
- Maintain a one‑command rollback; document it in the repo
Anchor on Headers, Scope the DOM, Then Extract
Reliability jumps when you narrow the search surface. Find a human‑readable section header, climb to the nearest parent container, and restrict extraction to that subtree. Prefer semantic locators with explicit fallbacks and per‑step retries.
- “Find header → closest parent container → limit queries to subtree”
- Write instructions specific enough to exclude adjacent sections
- If selector confidence < threshold, log and retry next‑best container
Keep the Main Flow Clean to Teach and Scale
Treat the repo as a learning surface. The primary function should read end‑to‑end: hoist schemas and constants up, keep a single main flow in the middle, and push helpers to the bottom. Minimize helper surface and pass typed inputs/outputs.
- Top: schema/constants; Middle: single main flow; Bottom: helpers
- Pass typed inputs; return typed outputs; keep helpers narrow
- Add brief docstrings with example success and failure outputs