
Field Notes - Nov 20, '25
Executive Signals
- Contexts over processes: durable scale without memory blowups in headless automation
- Latency is capacity: shaved seconds beat new servers under concurrency caps
- Ephemeral workers, smaller blast radius: clearer autoscaling and better spot economics under bursty loads
- Demos as gates: end-to-end runs set contracts, not feature parades
Product
Demo Retro As An Integration Gate
Use a near-term demo retro as an integration checkpoint, not a feature parade. Run the real flow end-to-end to surface failure modes early, then lock the next tranche behind SLAs and schema contracts. The goal is observable readiness: p95 runtime within thresholds, error rates under control, and no manual shims outside the runbook.
- Freeze field and enum names ≥5 business days prior; publish a mapping doc
- Keep 2–3 toggleable test records per integration to validate idempotence and retries
- Define pass/fail: p95 runtime target, <2% errors, zero manual steps
Engineering
One Browser, Many Contexts For Scale
Multiple headless processes per VM exhaust memory and crash under load. Pool a single browser instance and schedule jobs across isolated contexts (tabs). Drive concurrency from the queue, enforce strict timeouts, and aggressively recycle contexts to keep RSS predictable.
- Cap concurrent contexts to ~1–2 per vCPU; tune down if p95 RSS climbs
- Recreate a context after N tasks or any crash; capture crash/timeout telemetry
- Disable GPU and extensions; enforce queue backpressure over fire-and-forget
Cut Step Time Before Adding Servers
When infra concurrency is capped, latency reductions compound capacity faster than hardware. Dropping a key search step from ~30s+ to ~5s yielded ~6x step throughput. Apply this across adapters before scaling machines.
- Set per-step budgets (target p95 ≤ 8s; end-to-end ≤ 60s) and fail fast on regressions
- Cache hot selectors/results, remove duplicate queries, and pre-warm sessions
- Track jobs/hour = concurrency × 3600 ÷ avg_job_seconds; fix the top offender first
Isolate Headless Jobs With Ephemeral Workers
A single shared VM creates a noisy-neighbor failure domain. Move headless work to ephemeral workers (containers or spot instances) behind a queue. You gain fault isolation, simpler autoscaling, and better unit economics for bursty workloads.
- Scale workers 0→N on backlog and CPU; hard-cap per-worker contexts and memory
- Use preemptible/spot only with idempotent jobs and checkpointed progress
- Kill-and-replace any worker breaching time/memory limits; alert if p95 errors >2%