
Field Notes - Nov 25, '25
Executive Signals
- Retries are the new SLAs: budget failure paths, show only final fails
- Nightly is the new realtime: batch by default, overrides for urgent exceptions
- Soak before speed: reliability signals trump micro-optimizations until core loops hold
- Value first, environments later: wire CRM loop, then reliability, then polish
- AI skins beat consoles: ship presentable internal dashboards, wire real data later
- LLM assist, not crutch: replay-first; reserve for structured site breakage only
Product
Sequence By Business Impact, Not Environments
Ship the value path first: close the CRM loop, then harden reliability, then worry about environments and polish. Tie objects and statuses in the CRM, layer in retries and error taxonomy, stand up identical stage/prod queues with proper keys and buckets, then cut over ahead of the billing cycle. Dashboard aesthetics can trail once the core loop is reliably clearing work.
- Connect CRM objects/statuses before UI polish
- Add retry/backoff and error taxonomy, then establish identical stage/prod
- Cut over before month-end; polish dashboards after stability
Engineering
Resilient Queues: Retries, Triage, Idempotency
Don’t scale a brittle pipeline. Budget failures up front with three attempts, exponential backoff, and jitter so humans only see terminal fails. Keep writes idempotent into the CRM. Classify errors (site flake vs. adapter bug) and route accordingly. Cache-and-replay first; use an LLM assist only on structured misses like DOM diffs or pop-ups. Emit a compact failure object (reason, last step, screenshot link) to speed triage.
- Implement 3-attempt exponential backoff with 2–10 minute jitter and idempotent writes
- Taxonomize errors and route flake vs. bug paths distinctly
- Emit a single failure object with artifacts for human review
Nightly Batches As Default, Overrides On Demand
For recurring compliance submissions, schedule an off-hours batch as the primary path. Overnight runs avoid peak contention and create a clean “done by morning” SLA. Keep manual buttons and webhooks as escape hatches for urgent work. Define success as clearing all new cases before business hours; post terminal fails back with artifacts. Allow ops to steer each batch via a prompt or parameter.
- Cron nightly at off-hours; expose “Run now” and webhook triggers
- Define “done by morning” and publish terminal fails with artifacts
- Add batch parameters (e.g., prioritize redaction-sensitive checks)
Soak-Test Early, Optimize Later
Prove reliability before chasing speed. A multi-hour soak across roughly 1k jobs with a low-single-digit fail rate is enough to proceed and surfaces missing pieces like pop-up handling and timeouts. Greenlight when pods don’t flap, fail rate is under ~3%, and p95 per run is under ~5 minutes. Expand adapters only after the soak passes; defer micro-optimizations if the morning SLA is met.
- Run a multi-hour soak; watch restarts, fail rate, and p95 latency
- Fix pop-up patterns/timeouts before widening adapter coverage
- Postpone micro-optimizations until SLA is consistently green
Use AI As Your Internal-Tool Designer
Executives judge internal dashboards by look-and-feel. Let AI propose a presentable skin (cards, human-readable dates, OEM labels, embedded screenshots), then pair-code to wire real data. Maintain a server-side “view output” with artifacts to avoid terminal-only debugging. Ship the AI-designed skin after reliability work and stage/prod parity are in place—the win is speed to presentation-grade, not pixel perfection.
- Prompt AI for non-dev UI patterns; iterate to “presentation-grade”
- Keep a server-side artifact viewer for rapid triage
- Ship the skin after reliability and stage/prod are live