new-site/docs/plan.mx-exclusion-gaps.md
justin 4f52d12629 docs: mark MX-exclusion plan complete (all 3 fixes shipped)
Fix 2 (untagged NULL bucket cap) shipped in bc93d93; default is no-starve.
Plan fully implemented.
2026-06-20 00:21:52 -05:00

9.2 KiB

Plan: close the MX-exclusion gaps in the trucking warmup

Status: ALL THREE FIXES SHIPPED 2026-06-20 (Fix 1+3 9eeed47, Fix 2 bc93d93). Owner context: warmup day 17; big operators (Google/Microsoft/Proofpoint/ Mimecast/Barracuda/Cisco/Broadcom) are EXCLUDED until day 30, then re-introduced via mx_daily_caps(). This plan fixes three holes that let throttling/consumer MX operators through during that window.

What shipped (2026-06-20, commit 9eeed47)

  • Fix 1 (DONE): CONSUMER_MX_OPERATORS (mx:yahoodns.net, mx:icloud.com, comcast/charter/centurylink/windstream/tds/earthlink) folded into WARMUP_EXCLUDE_OPERATORS, used by both the fetch_carriers() exclusion SQL and mx_daily_caps() (same day-30 ramp). Verified live: warmup-eligible pool = 353,909 carriers after the fix (not starved), and mx_daily_caps() returns cap 0 for mx:yahoodns.net during warmup.
  • Fix 3 (DONE): infra/cron/pw-mx-tag installed to /etc/cron.d/ (05:45 UTC daily, --only-unsent --limit-domains 20000). Verified: a 200-domain test run tagged 216 domains; idempotent/bounded.
  • Fix 2 (DONE): select_sendable_carriers() now bounds the untagged (NULL mx_provider) bucket with a single shared untagged_cap (env MAIN_UNTAGGED_MX_CAP, default max(quota, 200) = no-starve / no behavior change today). Only ~3,035 distinct verified-sendable untagged domains remain, so pw-mx-tag drains them in its first run; tighten the cap to a fraction of quota afterward to prefer the tagged long tail. Commit bc93d93.

Background: how the two MX layers work today

Sender reputation is judged by the receiving operator (MX), not the recipient domain string. There are two independent gates in scripts/build_trucking_campaigns.py:

  1. fetch_carriers() big-MX EXCLUSION (SQL big_mx_exclude): during warmup (main_warmup_day() <= MAIN_BIG_MX_EXCLUDE_UNTIL_DAY, currently day 30) it drops carriers whose mx_provider IN BIG_MX_OPERATORS. mx_provider IS NULL is deliberately KEPT (so the pool isn't starved before tagging completes).
  2. select_sendable_carriers() per-MX THROTTLE (mx_daily_caps + per_op cap): bounds how many of a run's quota go to each KNOWN operator so we never concentrate on one. NULL is NOT capped (would collapse onto one bucket and starve the pool).

mx_provider is populated by scripts/mx_tag_carriers.py, which resolves each domain's MX and returns either a clean label (google, microsoft, proofpoint, mimecast, cisco, barracuda, broadcom, godaddy, zoho, rackspace) or, for everything else, an mx:<root-domain> prefix (e.g. mx:yahoodns.net, mx:icloud.com, mx:comcast.net).


The three gaps (with live numbers, 2026-06-20)

Gap 1 — consumer/throttling MX behind the mx: prefix are NOT excluded

BIG_MX_OPERATORS only lists the clean labels. The big consumer mailbox operators get tagged with the mx: prefix and so slip BOTH gates during warmup:

mx_provider sendable carriers why it's a problem
mx:yahoodns.net 283,113 Yahoo Small Business / AOL custom domains — same aggressive consumer filtering + complaint-driven blocking as consumer Yahoo. By far the biggest hole.
mx:icloud.com 24,985 Apple iCloud+ Custom Domain — Apple consumer filtering; iCloud was the biggest consumer leak we already scrubbed from Listmonk.
mx:comcast.net 12,251 Comcast consumer infra; historically bouncy.
mx:charter.net 5,860 Spectrum/Charter consumer.
mx:centurylink.net / mx:windstream.net / mx:tds.net / mx:earthlink-vadesecure.net ~8,100 Legacy/satellite ISP consumer mail; many already in DEAD_ISP_DOMAINS as literal domains but NOT caught when a custom domain points its MX there.

mx:yahoodns.net alone is 283k carriers that look "long-tail/safe" to the warmup but actually filter like a big operator. This is the headline fix.

NOTE: the literal-domain layer (BLOCKED_EMAIL_DOMAINS incl. the Yahoo family, Apple, dead ISPs) already blocks someone@yahoo.com / @icloud.com. The hole is a custom domain whose MX points at Yahoo/iCloud — invisible to the string layer, only visible via MX tagging. That's exactly what this closes.

Gap 2 — 315,892 untagged (NULL) carriers are sent to unvetted

mx_provider IS NULL is kept by both gates by design (anti-starvation). With 315,892 sendable NULLs vs 1,187,054 tagged, a meaningful slice of every run goes to domains we've never MX-resolved — some of which are Google/MS/Yahoo we'd otherwise exclude. This is acceptable as a bootstrap but should shrink over time.

Gap 3 — mx_tag_carriers.py is not on a cron

There is no infra/cron/pw-mx-tag (confirmed: no cron references it). So the NULL backlog only shrinks when someone runs it by hand. New carriers imported by the FMCSA census downloader land as NULL and stay NULL. Without continuous tagging, Gaps 1 and 2 slowly re-open.


Proposed fixes

Fix 1 — exclude consumer/throttling mx: operators during warmup (HIGH)

Add an explicit set of mx:-prefixed operators that should be treated like the big operators during warmup, and fold them into BOTH the exclusion and the throttle. Keep it data-driven and documented.

# scripts/build_trucking_campaigns.py
# Consumer / aggressively-filtering mailbox operators that mx_tag_carriers.py
# labels with the "mx:" prefix (no clean label). They throttle/complaint-block
# like the big operators, so hold them out during warmup too. (yahoodns =
# Yahoo Small Business + AOL custom domains; icloud = Apple custom domains.)
CONSUMER_MX_OPERATORS = (
    "mx:yahoodns.net", "mx:icloud.com", "mx:comcast.net", "mx:charter.net",
    "mx:centurylink.net", "mx:windstream.net", "mx:tds.net",
    "mx:earthlink-vadesecure.net",
)
# Everything held out of the warmup pool entirely (until MAIN_BIG_MX_EXCLUDE_UNTIL_DAY).
WARMUP_EXCLUDE_OPERATORS = BIG_MX_OPERATORS + CONSUMER_MX_OPERATORS
  • In fetch_carriers(): build big_mx_exclude from WARMUP_EXCLUDE_OPERATORS (not just BIG_MX_OPERATORS).
  • In mx_daily_caps(): give CONSUMER_MX_OPERATORS the same big ramp as the clean big operators after day 30 (so they re-introduce gradually, not all at once on day 31).
  • Keep it behind the existing MAIN_SKIP_BIG_MX switch so it's reversible.

Effect: removes ~330k consumer-MX carriers from the warmup-window pool; the long tail of genuinely small/self-hosted systems carries the volume, which is the whole point of the warmup strategy.

Fix 2 — bound the NULL bucket with a small cap (MEDIUM)

Don't exclude NULL (still anti-starvation), but give it a real per-run cap in select_sendable_carriers() instead of "uncapped". E.g. treat unknown/NULL like __default__ but at a fraction (say 40/run) so an untagged Google/Yahoo domain can't flood a run. Pairs with Fix 3 (continuous tagging) to shrink the bucket.

Fix 3 — put mx_tag_carriers.py on a daily cron (MEDIUM)

Add infra/cron/pw-mx-tag (model on pw-listmonk-scrub) running e.g. 05:45 UTC (before the 08:00 trucking builder), tagging the next N thousand NULL domains/day:

45 5 * * * deploy cd /opt/performancewest && docker compose exec -T workers \
  python3 -m scripts.mx_tag_carriers --limit-domains 20000 \
  >> /var/log/pw-mx-tag.log 2>&1

Install to /etc/cron.d/ (deploy.sh doesn't run ansible). This continuously shrinks the 315k NULL backlog and keeps newly-imported carriers tagged, so Fixes 1 & 2 stay effective.


Validation plan (verify before/after, no sends triggered)

  1. Dry-run the selector before/after Fix 1 and diff the per-MX composition of a simulated run (the builder has list_segments() / quota selection paths that can be exercised read-only). Assert 0 carriers from CONSUMER_MX_OPERATORS are selected while main_warmup_day() <= 30.
  2. SQL sanity: SELECT mx_provider, count(*) ... WHERE listmonk_sent_at IS NULL GROUP BY 1 — confirm the excluded operators drop out of the candidate pool.
  3. Cron (Fix 3): run mx_tag_carriers --limit-domains 1000 once by hand, confirm the NULL count falls and no errors; then install the cron and confirm the next-day count fell again (idempotent, bounded).
  4. Regression: confirm the long-tail pool is still large enough to hit daily quota at warmup caps (so we don't starve the send). If the long tail is too small after excluding 330k consumer-MX, that's a signal to either lower the daily quota or accept a smaller controlled slice of one consumer operator.

Open questions / decisions for owner

  • Re-introduction after day 30: treat CONSUMER_MX_OPERATORS identically to the big operators (same ramp), or keep Yahoo/iCloud custom domains excluded longer (they convert worse and complain more)? Recommendation: same ramp, but watch the reputation monitor's per-operator reject% and pull back if Yahoo spikes.
  • NULL cap size (Fix 2): 40/run is a guess; tune against how fast Fix 3 drains the backlog.
  • Should mx: consumer exclusion be permanent (not just warmup)? For a B2B compliance product, a carrier reachable only at a Yahoo/iCloud custom domain is a low-value, high-complaint segment regardless of warmup. Worth considering a permanent down-weight, not just a warmup hold.