
Field Notes - Jan 06, '26
Executive Signals
- Change freezes are the new SLAs: protect hot queues like payments cutovers
- Determinism over discretion: encode policy in code, not ops judgment
- Means mislead, tails decide : test worst-case payloads, not happy paths
- Strings lie, schemas persist : scrape options, map to canonical vocabularies
- Inbox proof beats gut feel: auto-audit confirmations, self-heal mismatches
- One source of truth: CRM roles beat drift-prone docs for CCs
CEO
Machine-Enforced Cutoffs, Not Human Discretion
Late-cycle submissions need deterministic handling to avoid appeals and exceptions. Treat cutoff rules as governance encoded in software, not judgments made under pressure. Statuses and customer notes should reflect the rule, while audited overrides preserve managerial flexibility without eroding trust.
- If created_at > cutoff, set “Rejected: after cutoff” and add a customer-facing note
- Show counts in the daily ops report; track SLA to reschedule next cycle
- Allow role-based override with an audit trail
Product
Post-Submit Auto-Audit Loop
Humans caught mismatches by eyeballing emails and portal records. Automate the check-and-correct loop so errors resolve before customers notice. Compare identities between the confirmation email and the posted portal record, then either self-heal or file evidence.
- Within 5 minutes, fetch confirmation + portal record and compare identity fields
- On mismatch, auto-cancel and resubmit; else attach an evidence pack (screenshots, logs)
- Alert only on mismatch or missing confirmation after N minutes
Resubmission Rules the System Can Enforce
Duplicate approvals and accidental resubmits waste cycles and create audit risk. Encode resubmission as explicit system paths so operators don’t guess under time pressure.
- If has approval_id and dealer/site unchanged, run a “fix-only” path (no submit)
- If dealer/site mismatch, cancel in portal, clear approval_id, then full resubmit
- Drive via explicit fields; “cleared_in_portal=yes” triggers resubmit
Engineering
Freeze Deploys During High-Volume Runs
Hot patches during live batches created queue weirdness, partial runs, and restarts. Treat critical batch windows like payments cutovers: freeze deploys or allow canary-only with aggressive rollback. Idempotency and resumability turn failures into retries, not incidents.
- Gate deploys behind a batch_active lock; drain workers before swapping
- Canary 1 shard → 10% → 100%; auto-rollback on elevated retry/timeout
- Make jobs idempotent with resume tokens to survive restarts
Test the Heavy Tail, Not Just the Mean
Load tests passed while production OOM’d because payload shape, not volume, dominated memory. Design perf suites around worst-case paths: large recipient sets, option-rich pages, and long selector lists.
- Include 95th–99th percentile payloads and max-selector pages in perf tests
- Chunk long lists; cap work per transaction; backoff/retry with jitter
- Set memory requests/limits per step; alert on OOMKill and restart loops
Normalize OEM Portal Labels at the Edge
Portal “compliance team” selectors varied by casing, spacing, and synonyms. Literal strings are brittle. Build an edge adapter that scrapes options, maps to a canonical vocabulary, and fails closed below confidence.
- Trim/lowercase, fuzzy match with synonyms; require ≥0.9 confidence; fail closed
- Cache allowed values per portal daily; log misses and alert on new variants
- Publish the canonical dictionary and mapping audit trail
Customer Success
Make CC Contacts a CRM Role, Not a Google Doc
CC targets in a shared doc guarantee drift and misses. Move to CRM as a structured “Compliance Contact” role so automation pulls from one source of truth and updates propagate instantly.
- Create a single active “Compliance Contact” per account; expose via API
- Webhook on change to refresh worker cache; run nightly backfills until cutover
- Run a two-week dual-run, then deprecate the doc after validation