
Field Notes - Nov 21, '25
Executive Signals
- Cattle, not pets: disposable workers with global queues survive autoscaling churn
- Concurrency before capacity: per-worker caps tune stability before horizontal spend
- Cameras over logs: video and traces collapse debugging time in automation
- State machines, not timestamps: explicit enums fix truth and clean reporting
- Investor polish last: ship live ops, timebox design sweeps pre-demos
- Scope freeze beats sprawl: minimal viable flows de-risk near-term demos and dates
CEO
Investor-Grade Internals, Done Last
Polished internal dashboards signal operational excellence during diligence, but they belong after functionality hardens. Standardize a light design system so every internal tool feels coherent, then show live ops over static slides.
- Reserve a 1–2 day design sweep before investor or exec demos
- Create a reusable internal UI template and apply it across ops tools
- Prefer live operational dashboards over slides in high-stakes demos
Freeze Scope; Demo Core Plumbing First
Lock the near-term demo to the smallest credible flow. Sequence the plumbing: external storage first, global queue second, CRM wiring after. Push status checks and AI triage to Phase 2 once the pipes are proven, and use micro-deadlines to de-risk the date.
- Publish non-negotiable demo criteria and cut anything else
- Set 2–3 micro-deadlines this week to protect the milestone
- Pre-create Phase 2 tickets to accelerate the handoff
Product
Explicit States Beat Timestamp Hacks
Model truth with booleans and enumerated states rather than inferring from timestamps. Enforce allowed transitions and write updates atomically. Track automation vs human submission to keep reporting clean, and return failure reason codes for later triage.
- Define an explicit state machine and enforce legal transitions
- Add a submitted_by_automation flag and segment reporting on it
- Write back failure reasons with structured codes for triage
Engineering
Centralize State; Keep Pods Disposable
Autoscaling broke when queues and artifacts lived on individual pods. Treat pods as cattle: stateless workers that pull from a global queue and can die without losing history. Persist artifacts to object storage and surface job metadata from a database keyed by job_id.
- Use one global queue; workers pull jobs, not receive pushes
- Persist logs, screenshots, and videos to object storage; dashboards read storage or DB
- Emit job metadata to a DB keyed by job_id for universal visibility
Concurrency Caps Before Horizontal Scale
Stability returned by running one browser process with multiple contexts, then tuning concurrency. Right-size per-worker job caps first and autoscale second. Scale on queue depth and p95 job time, not CPU alone, and guard against burst storms. Isolate sessions to prevent cookie bleed.
- Start low, about 2–3 jobs per worker, and ratchet until errors or latency inflect
- Autoscale on queue depth and p95 job time with protective guardrails
- Isolate context storage when session bleed creates flaky behavior
Record the Run, Not Just the Logs
Text logs miss race conditions in headless automation. Capture video, screenshots, and step traces per job, store them with a TTL, and link artifacts directly in the dashboard by job_id. Debugging time drops when you can watch the failure and click into the exact step.
- Enable video on failures or sampled runs; capture step-by-step traces
- Store artifacts in object storage with TTL and link by job_id
- Bundle trace and artifacts for one-click reproduction