
Field Notes - Nov 24, '25
Executive Signals
- Events are the new logs: orchestration truth lives in durable job events
- Hardware localness matters: optimize where it runs, not where you code
- Interrupts are product, not bugs: treat banners and modals as versioned dependencies
- Fresh sessions find truth: reset state or E2E tests mislead every release
- Artifacts or it didn’t happen: centralized run evidence shortens postmortems and review cycles
Engineering
Reset State or Your E2E Tests Lie
Manual repros often close consent popups and bypass the real failure path. Start every repro with a clean session so adapters face the exact gates users do. Make this the default locally and in CI so failures are observable, not masked.
- Launch test browsers in incognito with storage and cache cleared
- Add a preflight that detects and dismisses consent/popups before flows
- Assert “no popups present” after login; fail the run if present
Make the Queue the Source of Truth
If workers complete jobs but the dashboard never records them, you have luck, not orchestration. Treat events as the ledger: every unit of work must emit durable status so visibility, retries, and SLAs rely on facts, not logs.
- Emit start, heartbeat, and terminal events per job; treat missing acks as failures
- Separate broker from result backend; power dashboards from the result store
- Block deploys if telemetry coverage for queued tasks is not 100%
Centralize Run Artifacts Early
Scattered logs and screenshots turn debugging into folklore. Use a shared object store and tie artifacts to job IDs so anyone can reconstruct a run in minutes and postmortems move from guesswork to evidence.
- Standardize paths: org/env/jobID/timestamp/*
- Attach artifact URIs to job records and surface them in the UI
- Set retention by severity; keep failures for 30–90 days
Optimize Where It Runs, Not Where You Code
Local timings rarely predict server reality. Optimize adapters with server-side benchmarks and guardrails, and only celebrate wins that move production latency, not laptop microbenchmarks.
- Add server performance CI with per-adapter SLAs
- Alert on regressions greater than 10% versus the last green baseline
- Prioritize hot paths; ignore improvements that don’t shift server latency
Treat Consent/UI Interrupts as First-Class Requirements
OEM and retail sites mutate UI with banners and one-off modals. Handle interrupts as a shared dependency, not bespoke fixes, so resilience improves with each new variant.
- Maintain a shared interrupts library (selectors, close actions, timeouts)
- Version it and roll updates across adapters via a single dependency
- Track an interrupt miss rate and drive it toward zero