Analysis-only plan (no code shipped). The trucking builder's warmup excludes big receiving operators (Google/MS/Proofpoint/...) by mx_provider, but three holes let throttling/consumer MX through during the day<=30 window: 1. Consumer operators tagged with the "mx:" prefix (mx:yahoodns.net = 283,113 sendable carriers, mx:icloud.com = 24,985, comcast/charter/centurylink/...) are NOT in BIG_MX_OPERATORS, so they slip both the exclusion and the throttle. These are custom domains whose MX points at Yahoo/iCloud -- invisible to the literal-domain blocklist, only catchable via MX tagging. Biggest hole. 2. 315,892 untagged (NULL) sendable carriers are sent to unvetted (kept by design for anti-starvation, but uncapped). 3. mx_tag_carriers.py is on no cron, so the NULL backlog never drains and new FMCSA imports stay untagged -- slowly re-opening gaps 1 and 2. Plan proposes: CONSUMER_MX_OPERATORS set folded into exclusion+throttle (behind the existing MAIN_SKIP_BIG_MX switch), a bounded cap on the NULL bucket, and a daily pw-mx-tag cron. Includes live numbers, validation steps (dry-run selector diff, no sends), and open decisions (re-introduction ramp, permanent vs warmup- only exclusion for Yahoo/iCloud custom domains).
151 lines
8.1 KiB
Markdown
151 lines
8.1 KiB
Markdown
# Plan: close the MX-exclusion gaps in the trucking warmup
|
|
|
|
**Status:** PROPOSED (2026-06-20). Analysis + design only; no code shipped yet.
|
|
**Owner context:** warmup day 17; big operators (Google/Microsoft/Proofpoint/
|
|
Mimecast/Barracuda/Cisco/Broadcom) are EXCLUDED until day 30, then re-introduced
|
|
via `mx_daily_caps()`. This plan fixes three holes that let throttling/consumer
|
|
MX operators through during that window.
|
|
|
|
---
|
|
|
|
## Background: how the two MX layers work today
|
|
|
|
Sender reputation is judged by the **receiving operator (MX)**, not the recipient
|
|
domain string. There are two independent gates in `scripts/build_trucking_campaigns.py`:
|
|
|
|
1. **`fetch_carriers()` big-MX EXCLUSION** (SQL `big_mx_exclude`): during warmup
|
|
(`main_warmup_day() <= MAIN_BIG_MX_EXCLUDE_UNTIL_DAY`, currently day 30) it
|
|
drops carriers whose `mx_provider IN BIG_MX_OPERATORS`. `mx_provider IS NULL`
|
|
is deliberately KEPT (so the pool isn't starved before tagging completes).
|
|
2. **`select_sendable_carriers()` per-MX THROTTLE** (`mx_daily_caps` +
|
|
`per_op` cap): bounds how many of a run's quota go to each KNOWN operator so
|
|
we never concentrate on one. NULL is NOT capped (would collapse onto one
|
|
bucket and starve the pool).
|
|
|
|
`mx_provider` is populated by `scripts/mx_tag_carriers.py`, which resolves each
|
|
domain's MX and returns either a **clean label** (`google`, `microsoft`,
|
|
`proofpoint`, `mimecast`, `cisco`, `barracuda`, `broadcom`, `godaddy`, `zoho`,
|
|
`rackspace`) or, for everything else, an **`mx:<root-domain>` prefix** (e.g.
|
|
`mx:yahoodns.net`, `mx:icloud.com`, `mx:comcast.net`).
|
|
|
|
---
|
|
|
|
## The three gaps (with live numbers, 2026-06-20)
|
|
|
|
### Gap 1 — consumer/throttling MX behind the `mx:` prefix are NOT excluded
|
|
`BIG_MX_OPERATORS` only lists the clean labels. The big consumer mailbox
|
|
operators get tagged with the `mx:` prefix and so slip BOTH gates during warmup:
|
|
|
|
| mx_provider | sendable carriers | why it's a problem |
|
|
| --- | --- | --- |
|
|
| `mx:yahoodns.net` | **283,113** | Yahoo Small Business / AOL custom domains — same aggressive consumer filtering + complaint-driven blocking as consumer Yahoo. By far the biggest hole. |
|
|
| `mx:icloud.com` | **24,985** | Apple iCloud+ Custom Domain — Apple consumer filtering; iCloud was the biggest consumer leak we already scrubbed from Listmonk. |
|
|
| `mx:comcast.net` | 12,251 | Comcast consumer infra; historically bouncy. |
|
|
| `mx:charter.net` | 5,860 | Spectrum/Charter consumer. |
|
|
| `mx:centurylink.net` / `mx:windstream.net` / `mx:tds.net` / `mx:earthlink-vadesecure.net` | ~8,100 | Legacy/satellite ISP consumer mail; many already in `DEAD_ISP_DOMAINS` as literal domains but NOT caught when a custom domain points its MX there. |
|
|
|
|
`mx:yahoodns.net` alone is **283k** carriers that look "long-tail/safe" to the
|
|
warmup but actually filter like a big operator. This is the headline fix.
|
|
|
|
> NOTE: the literal-domain layer (`BLOCKED_EMAIL_DOMAINS` incl. the Yahoo family,
|
|
> Apple, dead ISPs) already blocks `someone@yahoo.com` / `@icloud.com`. The hole
|
|
> is a **custom domain whose MX points at Yahoo/iCloud** — invisible to the
|
|
> string layer, only visible via MX tagging. That's exactly what this closes.
|
|
|
|
### Gap 2 — 315,892 untagged (NULL) carriers are sent to unvetted
|
|
`mx_provider IS NULL` is kept by both gates by design (anti-starvation). With
|
|
**315,892** sendable NULLs vs 1,187,054 tagged, a meaningful slice of every run
|
|
goes to domains we've never MX-resolved — some of which are Google/MS/Yahoo we'd
|
|
otherwise exclude. This is acceptable as a bootstrap but should shrink over time.
|
|
|
|
### Gap 3 — `mx_tag_carriers.py` is not on a cron
|
|
There is no `infra/cron/pw-mx-tag` (confirmed: no cron references it). So the NULL
|
|
backlog only shrinks when someone runs it by hand. New carriers imported by the
|
|
FMCSA census downloader land as NULL and stay NULL. Without continuous tagging,
|
|
Gaps 1 and 2 slowly re-open.
|
|
|
|
---
|
|
|
|
## Proposed fixes
|
|
|
|
### Fix 1 — exclude consumer/throttling `mx:` operators during warmup (HIGH)
|
|
Add an explicit set of `mx:`-prefixed operators that should be treated like the
|
|
big operators during warmup, and fold them into BOTH the exclusion and the
|
|
throttle. Keep it data-driven and documented.
|
|
|
|
```python
|
|
# scripts/build_trucking_campaigns.py
|
|
# Consumer / aggressively-filtering mailbox operators that mx_tag_carriers.py
|
|
# labels with the "mx:" prefix (no clean label). They throttle/complaint-block
|
|
# like the big operators, so hold them out during warmup too. (yahoodns =
|
|
# Yahoo Small Business + AOL custom domains; icloud = Apple custom domains.)
|
|
CONSUMER_MX_OPERATORS = (
|
|
"mx:yahoodns.net", "mx:icloud.com", "mx:comcast.net", "mx:charter.net",
|
|
"mx:centurylink.net", "mx:windstream.net", "mx:tds.net",
|
|
"mx:earthlink-vadesecure.net",
|
|
)
|
|
# Everything held out of the warmup pool entirely (until MAIN_BIG_MX_EXCLUDE_UNTIL_DAY).
|
|
WARMUP_EXCLUDE_OPERATORS = BIG_MX_OPERATORS + CONSUMER_MX_OPERATORS
|
|
```
|
|
- In `fetch_carriers()`: build `big_mx_exclude` from `WARMUP_EXCLUDE_OPERATORS`
|
|
(not just `BIG_MX_OPERATORS`).
|
|
- In `mx_daily_caps()`: give `CONSUMER_MX_OPERATORS` the same `big` ramp as the
|
|
clean big operators after day 30 (so they re-introduce gradually, not all at
|
|
once on day 31).
|
|
- Keep it behind the existing `MAIN_SKIP_BIG_MX` switch so it's reversible.
|
|
|
|
**Effect:** removes ~330k consumer-MX carriers from the warmup-window pool; the
|
|
long tail of genuinely small/self-hosted systems carries the volume, which is the
|
|
whole point of the warmup strategy.
|
|
|
|
### Fix 2 — bound the NULL bucket with a small cap (MEDIUM)
|
|
Don't exclude NULL (still anti-starvation), but give it a real per-run cap in
|
|
`select_sendable_carriers()` instead of "uncapped". E.g. treat unknown/NULL like
|
|
`__default__` but at a fraction (say 40/run) so an untagged Google/Yahoo domain
|
|
can't flood a run. Pairs with Fix 3 (continuous tagging) to shrink the bucket.
|
|
|
|
### Fix 3 — put `mx_tag_carriers.py` on a daily cron (MEDIUM)
|
|
Add `infra/cron/pw-mx-tag` (model on `pw-listmonk-scrub`) running e.g. 05:45 UTC
|
|
(before the 08:00 trucking builder), tagging the next N thousand NULL domains/day:
|
|
```
|
|
45 5 * * * deploy cd /opt/performancewest && docker compose exec -T workers \
|
|
python3 -m scripts.mx_tag_carriers --limit-domains 20000 \
|
|
>> /var/log/pw-mx-tag.log 2>&1
|
|
```
|
|
Install to `/etc/cron.d/` (deploy.sh doesn't run ansible). This continuously
|
|
shrinks the 315k NULL backlog and keeps newly-imported carriers tagged, so Fixes
|
|
1 & 2 stay effective.
|
|
|
|
---
|
|
|
|
## Validation plan (verify before/after, no sends triggered)
|
|
|
|
1. **Dry-run the selector** before/after Fix 1 and diff the per-MX composition of
|
|
a simulated run (the builder has `list_segments()` / quota selection paths that
|
|
can be exercised read-only). Assert 0 carriers from `CONSUMER_MX_OPERATORS`
|
|
are selected while `main_warmup_day() <= 30`.
|
|
2. **SQL sanity:** `SELECT mx_provider, count(*) ... WHERE listmonk_sent_at IS NULL
|
|
GROUP BY 1` — confirm the excluded operators drop out of the candidate pool.
|
|
3. **Cron (Fix 3):** run `mx_tag_carriers --limit-domains 1000` once by hand,
|
|
confirm the NULL count falls and no errors; then install the cron and confirm
|
|
the next-day count fell again (idempotent, bounded).
|
|
4. **Regression:** confirm the long-tail pool is still large enough to hit daily
|
|
quota at warmup caps (so we don't starve the send). If the long tail is too
|
|
small after excluding 330k consumer-MX, that's a signal to either lower the
|
|
daily quota or accept a smaller controlled slice of one consumer operator.
|
|
|
|
---
|
|
|
|
## Open questions / decisions for owner
|
|
|
|
- **Re-introduction after day 30:** treat `CONSUMER_MX_OPERATORS` identically to
|
|
the big operators (same ramp), or keep Yahoo/iCloud custom domains excluded
|
|
*longer* (they convert worse and complain more)? Recommendation: same ramp, but
|
|
watch the reputation monitor's per-operator reject% and pull back if Yahoo
|
|
spikes.
|
|
- **NULL cap size (Fix 2):** 40/run is a guess; tune against how fast Fix 3 drains
|
|
the backlog.
|
|
- **Should `mx:` consumer exclusion be permanent (not just warmup)?** For a
|
|
B2B compliance product, a carrier reachable only at a Yahoo/iCloud custom
|
|
domain is a low-value, high-complaint segment regardless of warmup. Worth
|
|
considering a permanent down-weight, not just a warmup hold.
|