Field Notes - Oct 17, '25

Executive Signals

Agents are the new apps: ship into chat, not standalone surfaces
Tall then small: validate with big models, then distill to cheap
Durability over orchestration: queues, retries, checkpoints beat premature platforms
Vibes before evals: founder runs matter until guardrails are needed
Alarms at night : agents triage first, humans only with blast radius

The fastest adoption comes from automating the repetitive, clearly bounded tasks everyone dreads. Start with queues that have crisp inputs and objective “done” states, keep a human as the final decision-maker, and enforce a kill switch if quality doesn’t exceed a 90% golden-path bar in two sprints.

Ask each function lead for their most‑hated weekly task; rank by volume × clarity
Constrain V1 to one source, one action, one decision gate; human approves
If golden paths miss twice, pivot to a clearer use case

Product

Ship Agents Where People Already Work

Agents stick when they live inside chat, email, or ticketing—not a new app. Design for strict platform deadlines and ephemeral conversations: acknowledge immediately, continue asynchronously, stream partials for risky steps, and log every tool call so sessions can resume after crashes.

Always reply before platform timeouts; queue follow‑ups in background
Stream intermediate results and request confirmation on destructive actions
Persist tool calls and state to enable resumable workflows

Vibes First, Evals Second

Early signal is qualitative: sit with the product and run real tasks until the experience feels right. When usage and contributors grow, move to CI‑backed evals that guard against regressions. Features that can’t pass golden‑path checks within two sprints should be cut or rethought.

Maintain 10–20 golden‑path tasks; demo them daily
Add CI evals once prompts/tools stabilize and there’s >1 builder
Remove or redesign features that fail two consecutive golden‑path runs

Engineering

Keep Orchestration Boring and Durable

Most failures come from platformizing too early. Favor a single service, single queue, and single datastore with durability patterns over multi‑agent schedulers. Implement sagas, idempotent tool endpoints, and at‑least‑once retries so work survives crashes and duplicates without surprises.

Enforce a 30‑day “no platform” rule; one service/queue/datastore
Checkpoint after every external call; recover with saga steps
Make retries idempotent from day one

Tool Protocols vs. Native Tools

Use a standard tool protocol only when third parties will build tools or you’re extending another agent. If you control both sides, native integrations win for latency, security, and simplicity. Prototype with a protocol if it speeds learning, but leave a shim for migration.

Choose native when security is strict, latency tight, schemas predictable
Keep a swappable shim to move between protocol and native
Audit tool permissions quarterly; least privilege by default

Model Strategy: Prove With Tall, Run With Small

When an agent underperforms, first validate ceiling with a slower, higher‑reasoning model. If it works, distill to a smaller, cheaper model while preserving behavior. Keep prompts and tools model‑agnostic, and set SLOs for cost, latency, and success before and after swaps.

Maintain “thinky” (quality) and “fast” (default) model paths
Version prompts with model‑specific tests in a registry
Track cost/task, p50/p95 latency, and success rate across swaps

Route anomalies to an agent first to gather context, correlate signals, and propose hypotheses with links and repro steps. Only escalate to humans when impact is clear. Tune thresholds for the agent’s time, not a person’s sleep, and continuously measure suppressed false positives.

All alerts → agent triage → human only with repro and blast radius
Require “can wait” vs. “wake now” labels with justification
Review suppressed false positives monthly and tighten rules

Making a Broken Fixer Agent Useful

Before giving up on a code/error‑fixing agent, raise its thinking ceiling and feed it better examples. Combine higher‑reasoning runs with retrieval of similar diffs, exemplar prompts, and plan→act→verify scaffolding that executes unit tests. If tall‑model + exemplars still fail within 24–48 hours, change the problem.

Build a retrieval corpus of past fixes (diff, error, tests) as a tool
Add structured plan→act→verify with automated test execution
Time‑box: if no gains in 1–2 days, pick a different target

Customer Success

Enterprise Agent Adoption Ladder

A durable rollout pattern: vendor builds the first agent with the customer, the second is customer‑built with vendor shadowing, the third is customer‑owned end‑to‑end. Optimize for capability transfer, not dependency, with explicit graduation criteria and shared internal practices.

Define graduation: owner, runbooks, on‑call, and KPI deltas
Stand up an internal agent guild and shared tool catalog
Measure weekly active users and % of workflow handled by the agent

Field Notes - Oct 17, '25

Executive Signals

Agents are the new apps: ship into chat, not standalone surfaces
Tall then small: validate with big models, then distill to cheap
Durability over orchestration: queues, retries, checkpoints beat premature platforms
Vibes before evals: founder runs matter until guardrails are needed
Alarms at night : agents triage first, humans only with blast radius

CEO

Pick Hated, Structured Work First

Ask each function lead for their most‑hated weekly task; rank by volume × clarity
Constrain V1 to one source, one action, one decision gate; human approves
If golden paths miss twice, pivot to a clearer use case