
Field Notes - Jan 12, '26
Executive Signals
- Failure paths are features: design graceful handoffs before automation masks real onboarding gaps
- Edge cases over happy paths: corpus-driven QA beats per-brand demos and surface hidden regressions
- Timeouts aren't causes: classify first failure, route humans only when necessary
- CRM is the black box: write canonical reasons, not noise, on every attempt
- Retries are a budget: spend them on transients, pause on human fixes
Customer Success
Design Onboarding Failure Paths
New accounts without a recipient email will fail predictably. Treat this as a designed path: detect it early, surface a precise status in the system of record, and hand off cleanly to operations. This keeps automation reliable without prematurely building a full onboarding flow.
Mirror the vendor error in the CRM to remove ambiguity, then let ops add the missing recipient and requeue. Clear states and one-click remediation reduce cycle time and noisy alerting.
- Implement a preflight: if recipient is missing, set status to Needs Recipient and skip auto-retries
- Write the vendor error verbatim to CRM notes on every attempt
- Provide a one-click Add recipient → Retry flow; log the resolution reason
Engineering
Edge-Case Corpus Beats Per-Brand Happy Paths
Phase 1 overfit to a single test per vendor and missed edge conditions. For Phase 2, build a test corpus that mixes valid and invalid emails with portal and email patterns so status checks are fully testable without vendor involvement. Treat resubmission as lower-testability and isolate it.
Codify the surface area before implementation and hold releases to corpus results, not per-vendor demos. This shifts quality from anecdote to evidence.
- Lock an OEM-by-channel matrix (portal vs email) before implementation
- Create fixtures for malformed, missing, and delayed emails; target at least 80% edge-path coverage
- Release only on corpus pass, not per-vendor “green path” checks
Timeouts Hide Root Cause: Classify Before Retrying
Generic “timed out” notes inflate MTTR and bounce issues between teams. If the run fails waiting for an email field, the root cause is missing recipient, not unknown timeout. Classify on the first failure, write the reason to the CRM, then decide between retry and human action.
Retries should annotate every attempt so history is diagnostic, not decorative. Only transient failures earn another try.
- Map first failure to canonical reason codes (Missing Recipient, Portal Auth, Rate Limit)
- Persist the code plus human-readable detail to the system of record on each attempt
- Auto-retry only for transients; otherwise pause and assign to ops