
Field Notes - Jan 8, '26
Executive Signals
- Batch beats bravado: deployment freezes during runs protect the money hour
- Stateful beats stateless: retries as one job, logs and idempotency preserve truth
- Calendar is capacity planning: capacity mirrors calendar spikes, not comforting long-run averages
- Portals before parsing: source status where truth lives, email only when forced
- AI for the long tail: triage evidence fast, humans own ambiguous outcomes
Customer Success
Make Upstream Gaps First-Class Issues
Many “failures” are upstream data gaps, not system faults. Classify them as automation-failed with human-readable notes and avoid silent fallbacks. When a portal URL mismatches the CRM, fail and escalate to the issuer instead of selecting “Other,” which risks contaminating defaults. Fix once at the source, then re-run cleanly.
- Standardize a fault taxonomy; auto-write CRM notes with the exact blocker
- Auto-template issuer escalations with evidence and expected URL; track SLAs
- Provide one-click Return to New to re-run clean after data fix
Status Tracking: Portal First, Email Last
Pull status from issuer portals wherever available; resort to inbound email parsing only when the issuer mandates it. Standardize on one inbound provider and schema so retries, poison queues, and audits are consistent. Store raw messages and parsed artifacts, and hold a visible status SLO.
- Pick a single inbound provider; define schema, retries, and a poison queue
- Target 95% status updates reflected within 15 minutes of change
- Archive raw messages and parsed artifacts for auditability
Engineering
Operate the Plant: Batch Windows and Calendar Bursts
Mid-run redeploys multiplied failures; freeze deploys during active batch windows, breaking glass only for genuine infra emergencies. Load spikes concentrate around the 5th—plan capacity and processes around the calendar, not the average. Provide a scheduler kill switch plus pause/resume so operators can fix issues without compounding failures.
- Enforce no-deploy windows around peaks with tooling and alerts
- Scale workers 3–5x from the 3rd–7th; watch queue age and tail latency
- Add scheduler pause/kill and post-run smoke tests before resuming
Stateful Retries and Atomic Deletes
Chained retries hid context and dropped arguments. Collapse into a single self-restarting job that preserves inputs and appends per-attempt logs; retries re-queue to the tail to reduce contention. Deletion should be per-job and atomic: soft-delete with a tombstone and reason, prevent cross-attempt cascades, and allow a short undelete window for operator error.
- Persist attempt count, timestamps, stdout/stderr; backoff with a hard cap
- Keep stable job IDs for idempotency; block orphaned children on delete
- Implement soft-delete with audit trail and a short undelete window
AI-Assisted QA for Evidence Review
Bundling queue JSON and screenshot URLs into an LLM produced fast pass/fail labels and reconciled results with the tracking sheet. Use it to clear the long tail while maintaining human oversight. Treat model output as triage, not ground truth, especially where stakes are high.
- Automate export and store prompt, inputs, and outputs with each job
- Human spot-check 5–10% and all ambiguous cases
- Escalate deviations; never treat the model’s label as final truth