new-site/docs/plan.mx-exclusion-gaps.md
justin 285a4a087c docs: plan to close MX-exclusion gaps in trucking warmup
Analysis-only plan (no code shipped). The trucking builder's warmup excludes
big receiving operators (Google/MS/Proofpoint/...) by mx_provider, but three
holes let throttling/consumer MX through during the day<=30 window:

1. Consumer operators tagged with the "mx:" prefix (mx:yahoodns.net = 283,113
   sendable carriers, mx:icloud.com = 24,985, comcast/charter/centurylink/...)
   are NOT in BIG_MX_OPERATORS, so they slip both the exclusion and the throttle.
   These are custom domains whose MX points at Yahoo/iCloud -- invisible to the
   literal-domain blocklist, only catchable via MX tagging. Biggest hole.
2. 315,892 untagged (NULL) sendable carriers are sent to unvetted (kept by design
   for anti-starvation, but uncapped).
3. mx_tag_carriers.py is on no cron, so the NULL backlog never drains and new
   FMCSA imports stay untagged -- slowly re-opening gaps 1 and 2.

Plan proposes: CONSUMER_MX_OPERATORS set folded into exclusion+throttle (behind
the existing MAIN_SKIP_BIG_MX switch), a bounded cap on the NULL bucket, and a
daily pw-mx-tag cron. Includes live numbers, validation steps (dry-run selector
diff, no sends), and open decisions (re-introduction ramp, permanent vs warmup-
only exclusion for Yahoo/iCloud custom domains).
2026-06-19 23:55:15 -05:00

151 lines
8.1 KiB
Markdown

# Plan: close the MX-exclusion gaps in the trucking warmup
**Status:** PROPOSED (2026-06-20). Analysis + design only; no code shipped yet.
**Owner context:** warmup day 17; big operators (Google/Microsoft/Proofpoint/
Mimecast/Barracuda/Cisco/Broadcom) are EXCLUDED until day 30, then re-introduced
via `mx_daily_caps()`. This plan fixes three holes that let throttling/consumer
MX operators through during that window.
---
## Background: how the two MX layers work today
Sender reputation is judged by the **receiving operator (MX)**, not the recipient
domain string. There are two independent gates in `scripts/build_trucking_campaigns.py`:
1. **`fetch_carriers()` big-MX EXCLUSION** (SQL `big_mx_exclude`): during warmup
(`main_warmup_day() <= MAIN_BIG_MX_EXCLUDE_UNTIL_DAY`, currently day 30) it
drops carriers whose `mx_provider IN BIG_MX_OPERATORS`. `mx_provider IS NULL`
is deliberately KEPT (so the pool isn't starved before tagging completes).
2. **`select_sendable_carriers()` per-MX THROTTLE** (`mx_daily_caps` +
`per_op` cap): bounds how many of a run's quota go to each KNOWN operator so
we never concentrate on one. NULL is NOT capped (would collapse onto one
bucket and starve the pool).
`mx_provider` is populated by `scripts/mx_tag_carriers.py`, which resolves each
domain's MX and returns either a **clean label** (`google`, `microsoft`,
`proofpoint`, `mimecast`, `cisco`, `barracuda`, `broadcom`, `godaddy`, `zoho`,
`rackspace`) or, for everything else, an **`mx:<root-domain>` prefix** (e.g.
`mx:yahoodns.net`, `mx:icloud.com`, `mx:comcast.net`).
---
## The three gaps (with live numbers, 2026-06-20)
### Gap 1 — consumer/throttling MX behind the `mx:` prefix are NOT excluded
`BIG_MX_OPERATORS` only lists the clean labels. The big consumer mailbox
operators get tagged with the `mx:` prefix and so slip BOTH gates during warmup:
| mx_provider | sendable carriers | why it's a problem |
| --- | --- | --- |
| `mx:yahoodns.net` | **283,113** | Yahoo Small Business / AOL custom domains — same aggressive consumer filtering + complaint-driven blocking as consumer Yahoo. By far the biggest hole. |
| `mx:icloud.com` | **24,985** | Apple iCloud+ Custom Domain — Apple consumer filtering; iCloud was the biggest consumer leak we already scrubbed from Listmonk. |
| `mx:comcast.net` | 12,251 | Comcast consumer infra; historically bouncy. |
| `mx:charter.net` | 5,860 | Spectrum/Charter consumer. |
| `mx:centurylink.net` / `mx:windstream.net` / `mx:tds.net` / `mx:earthlink-vadesecure.net` | ~8,100 | Legacy/satellite ISP consumer mail; many already in `DEAD_ISP_DOMAINS` as literal domains but NOT caught when a custom domain points its MX there. |
`mx:yahoodns.net` alone is **283k** carriers that look "long-tail/safe" to the
warmup but actually filter like a big operator. This is the headline fix.
> NOTE: the literal-domain layer (`BLOCKED_EMAIL_DOMAINS` incl. the Yahoo family,
> Apple, dead ISPs) already blocks `someone@yahoo.com` / `@icloud.com`. The hole
> is a **custom domain whose MX points at Yahoo/iCloud** — invisible to the
> string layer, only visible via MX tagging. That's exactly what this closes.
### Gap 2 — 315,892 untagged (NULL) carriers are sent to unvetted
`mx_provider IS NULL` is kept by both gates by design (anti-starvation). With
**315,892** sendable NULLs vs 1,187,054 tagged, a meaningful slice of every run
goes to domains we've never MX-resolved — some of which are Google/MS/Yahoo we'd
otherwise exclude. This is acceptable as a bootstrap but should shrink over time.
### Gap 3 — `mx_tag_carriers.py` is not on a cron
There is no `infra/cron/pw-mx-tag` (confirmed: no cron references it). So the NULL
backlog only shrinks when someone runs it by hand. New carriers imported by the
FMCSA census downloader land as NULL and stay NULL. Without continuous tagging,
Gaps 1 and 2 slowly re-open.
---
## Proposed fixes
### Fix 1 — exclude consumer/throttling `mx:` operators during warmup (HIGH)
Add an explicit set of `mx:`-prefixed operators that should be treated like the
big operators during warmup, and fold them into BOTH the exclusion and the
throttle. Keep it data-driven and documented.
```python
# scripts/build_trucking_campaigns.py
# Consumer / aggressively-filtering mailbox operators that mx_tag_carriers.py
# labels with the "mx:" prefix (no clean label). They throttle/complaint-block
# like the big operators, so hold them out during warmup too. (yahoodns =
# Yahoo Small Business + AOL custom domains; icloud = Apple custom domains.)
CONSUMER_MX_OPERATORS = (
"mx:yahoodns.net", "mx:icloud.com", "mx:comcast.net", "mx:charter.net",
"mx:centurylink.net", "mx:windstream.net", "mx:tds.net",
"mx:earthlink-vadesecure.net",
)
# Everything held out of the warmup pool entirely (until MAIN_BIG_MX_EXCLUDE_UNTIL_DAY).
WARMUP_EXCLUDE_OPERATORS = BIG_MX_OPERATORS + CONSUMER_MX_OPERATORS
```
- In `fetch_carriers()`: build `big_mx_exclude` from `WARMUP_EXCLUDE_OPERATORS`
(not just `BIG_MX_OPERATORS`).
- In `mx_daily_caps()`: give `CONSUMER_MX_OPERATORS` the same `big` ramp as the
clean big operators after day 30 (so they re-introduce gradually, not all at
once on day 31).
- Keep it behind the existing `MAIN_SKIP_BIG_MX` switch so it's reversible.
**Effect:** removes ~330k consumer-MX carriers from the warmup-window pool; the
long tail of genuinely small/self-hosted systems carries the volume, which is the
whole point of the warmup strategy.
### Fix 2 — bound the NULL bucket with a small cap (MEDIUM)
Don't exclude NULL (still anti-starvation), but give it a real per-run cap in
`select_sendable_carriers()` instead of "uncapped". E.g. treat unknown/NULL like
`__default__` but at a fraction (say 40/run) so an untagged Google/Yahoo domain
can't flood a run. Pairs with Fix 3 (continuous tagging) to shrink the bucket.
### Fix 3 — put `mx_tag_carriers.py` on a daily cron (MEDIUM)
Add `infra/cron/pw-mx-tag` (model on `pw-listmonk-scrub`) running e.g. 05:45 UTC
(before the 08:00 trucking builder), tagging the next N thousand NULL domains/day:
```
45 5 * * * deploy cd /opt/performancewest && docker compose exec -T workers \
python3 -m scripts.mx_tag_carriers --limit-domains 20000 \
>> /var/log/pw-mx-tag.log 2>&1
```
Install to `/etc/cron.d/` (deploy.sh doesn't run ansible). This continuously
shrinks the 315k NULL backlog and keeps newly-imported carriers tagged, so Fixes
1 & 2 stay effective.
---
## Validation plan (verify before/after, no sends triggered)
1. **Dry-run the selector** before/after Fix 1 and diff the per-MX composition of
a simulated run (the builder has `list_segments()` / quota selection paths that
can be exercised read-only). Assert 0 carriers from `CONSUMER_MX_OPERATORS`
are selected while `main_warmup_day() <= 30`.
2. **SQL sanity:** `SELECT mx_provider, count(*) ... WHERE listmonk_sent_at IS NULL
GROUP BY 1` — confirm the excluded operators drop out of the candidate pool.
3. **Cron (Fix 3):** run `mx_tag_carriers --limit-domains 1000` once by hand,
confirm the NULL count falls and no errors; then install the cron and confirm
the next-day count fell again (idempotent, bounded).
4. **Regression:** confirm the long-tail pool is still large enough to hit daily
quota at warmup caps (so we don't starve the send). If the long tail is too
small after excluding 330k consumer-MX, that's a signal to either lower the
daily quota or accept a smaller controlled slice of one consumer operator.
---
## Open questions / decisions for owner
- **Re-introduction after day 30:** treat `CONSUMER_MX_OPERATORS` identically to
the big operators (same ramp), or keep Yahoo/iCloud custom domains excluded
*longer* (they convert worse and complain more)? Recommendation: same ramp, but
watch the reputation monitor's per-operator reject% and pull back if Yahoo
spikes.
- **NULL cap size (Fix 2):** 40/run is a guess; tune against how fast Fix 3 drains
the backlog.
- **Should `mx:` consumer exclusion be permanent (not just warmup)?** For a
B2B compliance product, a carrier reachable only at a Yahoo/iCloud custom
domain is a low-value, high-complaint segment regardless of warmup. Worth
considering a permanent down-weight, not just a warmup hold.