Field Notes - Nov 25, '25

Executive Signals

Retries are the new SLAs: budget failure paths, show only final fails
Nightly is the new realtime: batch by default, overrides for urgent exceptions
Soak before speed: reliability signals trump micro-optimizations until core loops hold
Value first, environments later: wire CRM loop, then reliability, then polish
AI skins beat consoles: ship presentable internal dashboards, wire real data later
LLM assist, not crutch: replay-first; reserve for structured site breakage only

Ship the value path first: close the CRM loop, then harden reliability, then worry about environments and polish. Tie objects and statuses in the CRM, layer in retries and error taxonomy, stand up identical stage/prod queues with proper keys and buckets, then cut over ahead of the billing cycle. Dashboard aesthetics can trail once the core loop is reliably clearing work.

Connect CRM objects/statuses before UI polish
Add retry/backoff and error taxonomy, then establish identical stage/prod
Cut over before month-end; polish dashboards after stability

Engineering

Resilient Queues: Retries, Triage, Idempotency

Don’t scale a brittle pipeline. Budget failures up front with three attempts, exponential backoff, and jitter so humans only see terminal fails. Keep writes idempotent into the CRM. Classify errors (site flake vs. adapter bug) and route accordingly. Cache-and-replay first; use an LLM assist only on structured misses like DOM diffs or pop-ups. Emit a compact failure object (reason, last step, screenshot link) to speed triage.

Implement 3-attempt exponential backoff with 2–10 minute jitter and idempotent writes
Taxonomize errors and route flake vs. bug paths distinctly
Emit a single failure object with artifacts for human review

Nightly Batches As Default, Overrides On Demand

For recurring compliance submissions, schedule an off-hours batch as the primary path. Overnight runs avoid peak contention and create a clean “done by morning” SLA. Keep manual buttons and webhooks as escape hatches for urgent work. Define success as clearing all new cases before business hours; post terminal fails back with artifacts. Allow ops to steer each batch via a prompt or parameter.

Cron nightly at off-hours; expose “Run now” and webhook triggers
Define “done by morning” and publish terminal fails with artifacts
Add batch parameters (e.g., prioritize redaction-sensitive checks)

Soak-Test Early, Optimize Later

Prove reliability before chasing speed. A multi-hour soak across roughly 1k jobs with a low-single-digit fail rate is enough to proceed and surfaces missing pieces like pop-up handling and timeouts. Greenlight when pods don’t flap, fail rate is under ~3%, and p95 per run is under ~5 minutes. Expand adapters only after the soak passes; defer micro-optimizations if the morning SLA is met.

Run a multi-hour soak; watch restarts, fail rate, and p95 latency
Fix pop-up patterns/timeouts before widening adapter coverage
Postpone micro-optimizations until SLA is consistently green

Use AI As Your Internal-Tool Designer

Executives judge internal dashboards by look-and-feel. Let AI propose a presentable skin (cards, human-readable dates, OEM labels, embedded screenshots), then pair-code to wire real data. Maintain a server-side “view output” with artifacts to avoid terminal-only debugging. Ship the AI-designed skin after reliability work and stage/prod parity are in place—the win is speed to presentation-grade, not pixel perfection.

Prompt AI for non-dev UI patterns; iterate to “presentation-grade”
Keep a server-side artifact viewer for rapid triage
Ship the skin after reliability and stage/prod are live

Executive Signals

Retries are the new SLAs: budget failure paths, show only final fails

Nightly is the new realtime: batch by default, overrides for urgent exceptions

Soak before speed: reliability signals trump micro-optimizations until core loops hold

Value first, environments later: wire CRM loop, then reliability, then polish

AI skins beat consoles: ship presentable internal dashboards, wire real data later

LLM assist, not crutch: replay-first; reserve for structured site breakage only

Product

Connect CRM objects/statuses before UI polish

Add retry/backoff and error taxonomy, then establish identical stage/prod

Cut over before month-end; polish dashboards after stability

Engineering

Implement 3-attempt exponential backoff with 2–10 minute jitter and idempotent writes

Taxonomize errors and route flake vs. bug paths distinctly

Emit a single failure object with artifacts for human review

Cron nightly at off-hours; expose “Run now” and webhook triggers

Define “done by morning” and publish terminal fails with artifacts

Add batch parameters (e.g., prioritize redaction-sensitive checks)

Run a multi-hour soak; watch restarts, fail rate, and p95 latency

Fix pop-up patterns/timeouts before widening adapter coverage

Postpone micro-optimizations until SLA is consistently green

Prompt AI for non-dev UI patterns; iterate to “presentation-grade”

Keep a server-side artifact viewer for rapid triage

Ship the skin after reliability and stage/prod are live

Field Notes

Field Notes - Nov 25, '25

Executive Signals

Product

Sequence By Business Impact, Not Environments

Engineering

Resilient Queues: Retries, Triage, Idempotency

Nightly Batches As Default, Overrides On Demand

Soak-Test Early, Optimize Later

Use AI As Your Internal-Tool Designer

Field Notes

Field Notes - Nov 25, '25

Executive Signals

Product

Sequence By Business Impact, Not Environments

Engineering

Resilient Queues: Retries, Triage, Idempotency

Nightly Batches As Default, Overrides On Demand

Soak-Test Early, Optimize Later

Use AI As Your Internal-Tool Designer