Field Notes - Jan 8, '26

Executive Signals

Batch beats bravado: deployment freezes during runs protect the money hour
Stateful beats stateless: retries as one job, logs and idempotency preserve truth
Calendar is capacity planning: capacity mirrors calendar spikes, not comforting long-run averages
Portals before parsing: source status where truth lives, email only when forced
AI for the long tail: triage evidence fast, humans own ambiguous outcomes

Many “failures” are upstream data gaps, not system faults. Classify them as automation-failed with human-readable notes and avoid silent fallbacks. When a portal URL mismatches the CRM, fail and escalate to the issuer instead of selecting “Other,” which risks contaminating defaults. Fix once at the source, then re-run cleanly.

Standardize a fault taxonomy; auto-write CRM notes with the exact blocker
Auto-template issuer escalations with evidence and expected URL; track SLAs
Provide one-click Return to New to re-run clean after data fix

Status Tracking: Portal First, Email Last

Pull status from issuer portals wherever available; resort to inbound email parsing only when the issuer mandates it. Standardize on one inbound provider and schema so retries, poison queues, and audits are consistent. Store raw messages and parsed artifacts, and hold a visible status SLO.

Pick a single inbound provider; define schema, retries, and a poison queue
Target 95% status updates reflected within 15 minutes of change
Archive raw messages and parsed artifacts for auditability

Engineering

Operate the Plant: Batch Windows and Calendar Bursts

Mid-run redeploys multiplied failures; freeze deploys during active batch windows, breaking glass only for genuine infra emergencies. Load spikes concentrate around the 5th—plan capacity and processes around the calendar, not the average. Provide a scheduler kill switch plus pause/resume so operators can fix issues without compounding failures.

Enforce no-deploy windows around peaks with tooling and alerts
Scale workers 3–5x from the 3rd–7th; watch queue age and tail latency
Add scheduler pause/kill and post-run smoke tests before resuming

Stateful Retries and Atomic Deletes

Chained retries hid context and dropped arguments. Collapse into a single self-restarting job that preserves inputs and appends per-attempt logs; retries re-queue to the tail to reduce contention. Deletion should be per-job and atomic: soft-delete with a tombstone and reason, prevent cross-attempt cascades, and allow a short undelete window for operator error.

Persist attempt count, timestamps, stdout/stderr; backoff with a hard cap
Keep stable job IDs for idempotency; block orphaned children on delete
Implement soft-delete with audit trail and a short undelete window

AI-Assisted QA for Evidence Review

Bundling queue JSON and screenshot URLs into an LLM produced fast pass/fail labels and reconciled results with the tracking sheet. Use it to clear the long tail while maintaining human oversight. Treat model output as triage, not ground truth, especially where stakes are high.

Automate export and store prompt, inputs, and outputs with each job
Human spot-check 5–10% and all ambiguous cases
Escalate deviations; never treat the model’s label as final truth

Executive Signals

Batch beats bravado: deployment freezes during runs protect the money hour

Stateful beats stateless: retries as one job, logs and idempotency preserve truth

Calendar is capacity planning: capacity mirrors calendar spikes, not comforting long-run averages

Portals before parsing: source status where truth lives, email only when forced

AI for the long tail: triage evidence fast, humans own ambiguous outcomes

Customer Success

Standardize a fault taxonomy; auto-write CRM notes with the exact blocker

Auto-template issuer escalations with evidence and expected URL; track SLAs

Provide one-click Return to New to re-run clean after data fix

Pick a single inbound provider; define schema, retries, and a poison queue

Target 95% status updates reflected within 15 minutes of change

Archive raw messages and parsed artifacts for auditability

Engineering

Enforce no-deploy windows around peaks with tooling and alerts

Scale workers 3–5x from the 3rd–7th; watch queue age and tail latency

Add scheduler pause/kill and post-run smoke tests before resuming

Persist attempt count, timestamps, stdout/stderr; backoff with a hard cap

Keep stable job IDs for idempotency; block orphaned children on delete

Implement soft-delete with audit trail and a short undelete window

Automate export and store prompt, inputs, and outputs with each job

Human spot-check 5–10% and all ambiguous cases

Escalate deviations; never treat the model’s label as final truth

Field Notes

Field Notes - Jan 8, '26

Executive Signals

Customer Success

Make Upstream Gaps First-Class Issues

Status Tracking: Portal First, Email Last

Engineering

Operate the Plant: Batch Windows and Calendar Bursts

Stateful Retries and Atomic Deletes

AI-Assisted QA for Evidence Review

Field Notes

Field Notes - Jan 8, '26

Executive Signals

Customer Success

Make Upstream Gaps First-Class Issues

Status Tracking: Portal First, Email Last

Engineering

Operate the Plant: Batch Windows and Calendar Bursts

Stateful Retries and Atomic Deletes

AI-Assisted QA for Evidence Review