diff --git a/docs/plan.mx-exclusion-gaps.md b/docs/plan.mx-exclusion-gaps.md new file mode 100644 index 0000000..c256238 --- /dev/null +++ b/docs/plan.mx-exclusion-gaps.md @@ -0,0 +1,151 @@ +# Plan: close the MX-exclusion gaps in the trucking warmup + +**Status:** PROPOSED (2026-06-20). Analysis + design only; no code shipped yet. +**Owner context:** warmup day 17; big operators (Google/Microsoft/Proofpoint/ +Mimecast/Barracuda/Cisco/Broadcom) are EXCLUDED until day 30, then re-introduced +via `mx_daily_caps()`. This plan fixes three holes that let throttling/consumer +MX operators through during that window. + +--- + +## Background: how the two MX layers work today + +Sender reputation is judged by the **receiving operator (MX)**, not the recipient +domain string. There are two independent gates in `scripts/build_trucking_campaigns.py`: + +1. **`fetch_carriers()` big-MX EXCLUSION** (SQL `big_mx_exclude`): during warmup + (`main_warmup_day() <= MAIN_BIG_MX_EXCLUDE_UNTIL_DAY`, currently day 30) it + drops carriers whose `mx_provider IN BIG_MX_OPERATORS`. `mx_provider IS NULL` + is deliberately KEPT (so the pool isn't starved before tagging completes). +2. **`select_sendable_carriers()` per-MX THROTTLE** (`mx_daily_caps` + + `per_op` cap): bounds how many of a run's quota go to each KNOWN operator so + we never concentrate on one. NULL is NOT capped (would collapse onto one + bucket and starve the pool). + +`mx_provider` is populated by `scripts/mx_tag_carriers.py`, which resolves each +domain's MX and returns either a **clean label** (`google`, `microsoft`, +`proofpoint`, `mimecast`, `cisco`, `barracuda`, `broadcom`, `godaddy`, `zoho`, +`rackspace`) or, for everything else, an **`mx:` prefix** (e.g. +`mx:yahoodns.net`, `mx:icloud.com`, `mx:comcast.net`). + +--- + +## The three gaps (with live numbers, 2026-06-20) + +### Gap 1 — consumer/throttling MX behind the `mx:` prefix are NOT excluded +`BIG_MX_OPERATORS` only lists the clean labels. The big consumer mailbox +operators get tagged with the `mx:` prefix and so slip BOTH gates during warmup: + +| mx_provider | sendable carriers | why it's a problem | +| --- | --- | --- | +| `mx:yahoodns.net` | **283,113** | Yahoo Small Business / AOL custom domains — same aggressive consumer filtering + complaint-driven blocking as consumer Yahoo. By far the biggest hole. | +| `mx:icloud.com` | **24,985** | Apple iCloud+ Custom Domain — Apple consumer filtering; iCloud was the biggest consumer leak we already scrubbed from Listmonk. | +| `mx:comcast.net` | 12,251 | Comcast consumer infra; historically bouncy. | +| `mx:charter.net` | 5,860 | Spectrum/Charter consumer. | +| `mx:centurylink.net` / `mx:windstream.net` / `mx:tds.net` / `mx:earthlink-vadesecure.net` | ~8,100 | Legacy/satellite ISP consumer mail; many already in `DEAD_ISP_DOMAINS` as literal domains but NOT caught when a custom domain points its MX there. | + +`mx:yahoodns.net` alone is **283k** carriers that look "long-tail/safe" to the +warmup but actually filter like a big operator. This is the headline fix. + +> NOTE: the literal-domain layer (`BLOCKED_EMAIL_DOMAINS` incl. the Yahoo family, +> Apple, dead ISPs) already blocks `someone@yahoo.com` / `@icloud.com`. The hole +> is a **custom domain whose MX points at Yahoo/iCloud** — invisible to the +> string layer, only visible via MX tagging. That's exactly what this closes. + +### Gap 2 — 315,892 untagged (NULL) carriers are sent to unvetted +`mx_provider IS NULL` is kept by both gates by design (anti-starvation). With +**315,892** sendable NULLs vs 1,187,054 tagged, a meaningful slice of every run +goes to domains we've never MX-resolved — some of which are Google/MS/Yahoo we'd +otherwise exclude. This is acceptable as a bootstrap but should shrink over time. + +### Gap 3 — `mx_tag_carriers.py` is not on a cron +There is no `infra/cron/pw-mx-tag` (confirmed: no cron references it). So the NULL +backlog only shrinks when someone runs it by hand. New carriers imported by the +FMCSA census downloader land as NULL and stay NULL. Without continuous tagging, +Gaps 1 and 2 slowly re-open. + +--- + +## Proposed fixes + +### Fix 1 — exclude consumer/throttling `mx:` operators during warmup (HIGH) +Add an explicit set of `mx:`-prefixed operators that should be treated like the +big operators during warmup, and fold them into BOTH the exclusion and the +throttle. Keep it data-driven and documented. + +```python +# scripts/build_trucking_campaigns.py +# Consumer / aggressively-filtering mailbox operators that mx_tag_carriers.py +# labels with the "mx:" prefix (no clean label). They throttle/complaint-block +# like the big operators, so hold them out during warmup too. (yahoodns = +# Yahoo Small Business + AOL custom domains; icloud = Apple custom domains.) +CONSUMER_MX_OPERATORS = ( + "mx:yahoodns.net", "mx:icloud.com", "mx:comcast.net", "mx:charter.net", + "mx:centurylink.net", "mx:windstream.net", "mx:tds.net", + "mx:earthlink-vadesecure.net", +) +# Everything held out of the warmup pool entirely (until MAIN_BIG_MX_EXCLUDE_UNTIL_DAY). +WARMUP_EXCLUDE_OPERATORS = BIG_MX_OPERATORS + CONSUMER_MX_OPERATORS +``` +- In `fetch_carriers()`: build `big_mx_exclude` from `WARMUP_EXCLUDE_OPERATORS` + (not just `BIG_MX_OPERATORS`). +- In `mx_daily_caps()`: give `CONSUMER_MX_OPERATORS` the same `big` ramp as the + clean big operators after day 30 (so they re-introduce gradually, not all at + once on day 31). +- Keep it behind the existing `MAIN_SKIP_BIG_MX` switch so it's reversible. + +**Effect:** removes ~330k consumer-MX carriers from the warmup-window pool; the +long tail of genuinely small/self-hosted systems carries the volume, which is the +whole point of the warmup strategy. + +### Fix 2 — bound the NULL bucket with a small cap (MEDIUM) +Don't exclude NULL (still anti-starvation), but give it a real per-run cap in +`select_sendable_carriers()` instead of "uncapped". E.g. treat unknown/NULL like +`__default__` but at a fraction (say 40/run) so an untagged Google/Yahoo domain +can't flood a run. Pairs with Fix 3 (continuous tagging) to shrink the bucket. + +### Fix 3 — put `mx_tag_carriers.py` on a daily cron (MEDIUM) +Add `infra/cron/pw-mx-tag` (model on `pw-listmonk-scrub`) running e.g. 05:45 UTC +(before the 08:00 trucking builder), tagging the next N thousand NULL domains/day: +``` +45 5 * * * deploy cd /opt/performancewest && docker compose exec -T workers \ + python3 -m scripts.mx_tag_carriers --limit-domains 20000 \ + >> /var/log/pw-mx-tag.log 2>&1 +``` +Install to `/etc/cron.d/` (deploy.sh doesn't run ansible). This continuously +shrinks the 315k NULL backlog and keeps newly-imported carriers tagged, so Fixes +1 & 2 stay effective. + +--- + +## Validation plan (verify before/after, no sends triggered) + +1. **Dry-run the selector** before/after Fix 1 and diff the per-MX composition of + a simulated run (the builder has `list_segments()` / quota selection paths that + can be exercised read-only). Assert 0 carriers from `CONSUMER_MX_OPERATORS` + are selected while `main_warmup_day() <= 30`. +2. **SQL sanity:** `SELECT mx_provider, count(*) ... WHERE listmonk_sent_at IS NULL + GROUP BY 1` — confirm the excluded operators drop out of the candidate pool. +3. **Cron (Fix 3):** run `mx_tag_carriers --limit-domains 1000` once by hand, + confirm the NULL count falls and no errors; then install the cron and confirm + the next-day count fell again (idempotent, bounded). +4. **Regression:** confirm the long-tail pool is still large enough to hit daily + quota at warmup caps (so we don't starve the send). If the long tail is too + small after excluding 330k consumer-MX, that's a signal to either lower the + daily quota or accept a smaller controlled slice of one consumer operator. + +--- + +## Open questions / decisions for owner + +- **Re-introduction after day 30:** treat `CONSUMER_MX_OPERATORS` identically to + the big operators (same ramp), or keep Yahoo/iCloud custom domains excluded + *longer* (they convert worse and complain more)? Recommendation: same ramp, but + watch the reputation monitor's per-operator reject% and pull back if Yahoo + spikes. +- **NULL cap size (Fix 2):** 40/run is a guess; tune against how fast Fix 3 drains + the backlog. +- **Should `mx:` consumer exclusion be permanent (not just warmup)?** For a + B2B compliance product, a carrier reachable only at a Yahoo/iCloud custom + domain is a low-value, high-complaint segment regardless of warmup. Worth + considering a permanent down-weight, not just a warmup hold.