
Field Notes - Dec 18, '25
Executive Signals
- Failures are the fuel: train on discarded edge cases to unlock durable gains
- Auth is product, not RL: pre-auth scaffolding stabilizes agents and preserves secrets
- Cron is code: rehearse vendor windows, version schedules, and stage timezone chaos
- Hit the window, harden later: run first in lower env, promote next cycle
- Quant beats vibes: require multi-day gates on retries, failures, and SLO adherence
- Fixtures beat prod: containerized junk UIs create scalable, reproducible training signal
Product
Build Failure-First Gyms
Open-source harnesses drop the messy 10–15% where agents truly fail. That’s the learning gold. Construct containerized fixtures that mirror real UI debt and encode antipatterns that break agents, then scale runs without touching production.
- Label each task by failure pattern; version tasks, never silently change them
- Include date pickers, focus traps, lazy loads, re-auth on back, infinite scroll
- Report improvements against a baseline trained only on these failures
Two Modes, One Report
Serve both frontier labs and practitioners by shipping identical tasks in two modes: raw computer use (no helpers) and annotated (stable IDs, element hints). Publish the pass-rate and cost delta so buyers see reliability while labs get closest-to-metal traces.
- Capture mouse/keyboard/vision for raw mode; stabilize IDs for annotated mode
- Attribute failures by mode; highlight where annotations mask model gaps
- Keep artifacts identical so comparisons are credibly apples-to-apples
Prompts: Realistic vs Tuned
Bundle two instructions per task: a short “human-realistic” prompt and a capped “tuned” prompt. Score both and report the spread as prompt sensitivity to reveal crutches versus competence.
- Cap tuned prompts to ≤10 explicit instructions; no hidden chain-of-thought
- Flag tasks where tuned beats realistic by >25 points as not training-grade yet
- Track cost-to-quality for each prompt style
Auth as Product Surface
Identity flows sink agents more than logic does. Treat sessions, 2FA, and expiry as scaffolding outside the model loop. Pre-auth sessions, deterministic 2FA stubs, and a vault for secrets turn brittle runs into reproducible evaluations.
- Inject secrets at the harness layer; never expose them to the model
- Test forced re-login and expired sessions; score “session resilience” separately
- Isolate auth failures from task-logic scores to target fixes precisely
Close the Loop with Ground Truth
After submission, force extraction of confirmations (IDs, timestamps) and compare to the spec. Penalize OCR guesses on high-entropy strings and fail on any single-character mismatch for critical keys.
- Track instruction spec, submitted payload, and scraped confirmation as three artifacts
- Provide a literal copy/paste tool to reduce transcription errors
- Separate “field copy accuracy” from task completion in reporting
Hill-Climbable Pass Bands
Design 6–12 step, judgment-heavy workflows that land in a 40–60% pass band. Too easy lacks gradient; too hard lacks signal. Predefine rubrics and use multi-adjudicated LLM-as-judge for stability.
- Keep horizons ≤30 minutes of interaction per task
- Reject tasks without deterministic scoring criteria
- Use 3+ adjudications per judgment for variance control
Engineering
Fixed Windows, Clean Production Hardening
For monthly or quarterly jobs with immovable run windows, hit the date in a controlled lower environment, then promote for the next cycle. Define quantitative promotion gates so speed doesn’t pollute production.
- Require ≥3 consecutive days at 0 critical failures; retries <2% within SLO
- Capture a runbook during lower-env execution; promote only what’s documented
- Execute a full dress rehearsal with prod-like data, perms, and observability
Environment Standup Is Wiring Work
Compute is fast; configuration is the bottleneck. Environment variables, secrets, object stores, IAM, vendor configs, and data plumbing dominate timelines. Budget accordingly and define “ready” by end-to-end proof, not Terraform success.
- Budget 2–3× the “infra up” time for config and E2E testing
- Ship a preflight checklist per environment (buckets, secrets, queues, callbacks, quotas)
- Define ready as one full E2E job under load with rollback proven
Schedules Are Code
Vendor cron timing, time zones, and partner batch windows are failure magnets. Version control schedules, rehearse them under production-like clocks and volumes, and include DST and month-boundary cases.
- Dry-run each partner schedule in staging; add manual override and freeze alarms
- Version jobs with explicit ownership and rollback scripts
- Monitor schedule drift; alert on missed or overlapping windows
Time-Box Security “Unfixables”
Scanner findings with upstream or dependency blockers aren’t ignorable—treat them as time-boxed exceptions with compensating controls until patched.
- Maintain an exception register with owner, control, and 30–60 day expiry
- Re-scan monthly; auto-close when upstreams patch or graphs change
- If not internet-exposed, enforce image pinning, WAF rules, and least-privileged IAM