Field Notes - Dec 18, '25

Executive Signals

Failures are the fuel: train on discarded edge cases to unlock durable gains
Auth is product, not RL: pre-auth scaffolding stabilizes agents and preserves secrets
Cron is code: rehearse vendor windows, version schedules, and stage timezone chaos
Hit the window, harden later: run first in lower env, promote next cycle
Quant beats vibes: require multi-day gates on retries, failures, and SLO adherence
Fixtures beat prod: containerized junk UIs create scalable, reproducible training signal

Open-source harnesses drop the messy 10–15% where agents truly fail. That’s the learning gold. Construct containerized fixtures that mirror real UI debt and encode antipatterns that break agents, then scale runs without touching production.

Label each task by failure pattern; version tasks, never silently change them
Include date pickers, focus traps, lazy loads, re-auth on back, infinite scroll
Report improvements against a baseline trained only on these failures

Two Modes, One Report

Serve both frontier labs and practitioners by shipping identical tasks in two modes: raw computer use (no helpers) and annotated (stable IDs, element hints). Publish the pass-rate and cost delta so buyers see reliability while labs get closest-to-metal traces.

Capture mouse/keyboard/vision for raw mode; stabilize IDs for annotated mode
Attribute failures by mode; highlight where annotations mask model gaps
Keep artifacts identical so comparisons are credibly apples-to-apples

Prompts: Realistic vs Tuned

Bundle two instructions per task: a short “human-realistic” prompt and a capped “tuned” prompt. Score both and report the spread as prompt sensitivity to reveal crutches versus competence.

Cap tuned prompts to ≤10 explicit instructions; no hidden chain-of-thought
Flag tasks where tuned beats realistic by >25 points as not training-grade yet
Track cost-to-quality for each prompt style

Auth as Product Surface

Identity flows sink agents more than logic does. Treat sessions, 2FA, and expiry as scaffolding outside the model loop. Pre-auth sessions, deterministic 2FA stubs, and a vault for secrets turn brittle runs into reproducible evaluations.

Inject secrets at the harness layer; never expose them to the model
Test forced re-login and expired sessions; score “session resilience” separately
Isolate auth failures from task-logic scores to target fixes precisely

Close the Loop with Ground Truth

After submission, force extraction of confirmations (IDs, timestamps) and compare to the spec. Penalize OCR guesses on high-entropy strings and fail on any single-character mismatch for critical keys.

Track instruction spec, submitted payload, and scraped confirmation as three artifacts
Provide a literal copy/paste tool to reduce transcription errors
Separate “field copy accuracy” from task completion in reporting

Hill-Climbable Pass Bands

Design 6–12 step, judgment-heavy workflows that land in a 40–60% pass band. Too easy lacks gradient; too hard lacks signal. Predefine rubrics and use multi-adjudicated LLM-as-judge for stability.

Keep horizons ≤30 minutes of interaction per task
Reject tasks without deterministic scoring criteria
Use 3+ adjudications per judgment for variance control

Engineering

Fixed Windows, Clean Production Hardening

For monthly or quarterly jobs with immovable run windows, hit the date in a controlled lower environment, then promote for the next cycle. Define quantitative promotion gates so speed doesn’t pollute production.

Require ≥3 consecutive days at 0 critical failures; retries <2% within SLO
Capture a runbook during lower-env execution; promote only what’s documented
Execute a full dress rehearsal with prod-like data, perms, and observability

Environment Standup Is Wiring Work

Compute is fast; configuration is the bottleneck. Environment variables, secrets, object stores, IAM, vendor configs, and data plumbing dominate timelines. Budget accordingly and define “ready” by end-to-end proof, not Terraform success.

Budget 2–3× the “infra up” time for config and E2E testing
Ship a preflight checklist per environment (buckets, secrets, queues, callbacks, quotas)
Define ready as one full E2E job under load with rollback proven

Schedules Are Code

Vendor cron timing, time zones, and partner batch windows are failure magnets. Version control schedules, rehearse them under production-like clocks and volumes, and include DST and month-boundary cases.

Dry-run each partner schedule in staging; add manual override and freeze alarms
Version jobs with explicit ownership and rollback scripts
Monitor schedule drift; alert on missed or overlapping windows

Time-Box Security “Unfixables”

Scanner findings with upstream or dependency blockers aren’t ignorable—treat them as time-boxed exceptions with compensating controls until patched.

Maintain an exception register with owner, control, and 30–60 day expiry
Re-scan monthly; auto-close when upstreams patch or graphs change
If not internet-exposed, enforce image pinning, WAF rules, and least-privileged IAM