From 98bcf0bbb04c8eba15344f47177ce7ea47d79b2b Mon Sep 17 00:00:00 2001 From: justin Date: Tue, 2 Jun 2026 12:25:33 -0500 Subject: [PATCH] docs: email deliverability + IP warmup runbook Document the self-hosted MTA layout, the May 30-31 reputation collapse, the Jun 02 remediation (retired burned IPs .91/.92/.93, swapped rotation to fresh .94/.95/.96, full Yahoo-family hold map, Listmonk sliding-window cap, paused the 13k-recipient blast scheduled for Jun 03), and the fresh-IP warmup rules + monitoring commands. --- docs/email-deliverability-runbook.md | 106 +++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 docs/email-deliverability-runbook.md diff --git a/docs/email-deliverability-runbook.md b/docs/email-deliverability-runbook.md new file mode 100644 index 0000000..4797dd2 --- /dev/null +++ b/docs/email-deliverability-runbook.md @@ -0,0 +1,106 @@ +# Email Deliverability & IP Warmup Runbook + +Performance West self-hosts its outbound MTA (Postfix on the app server) because +transactional relays (SES, Postmark, SendGrid) forbid the cold prospecting email +our FMCSA trucking and telecom campaigns depend on. That means **we own our +sending-IP reputation** and must manage it manually. This doc is the operational +guide for keeping it healthy. + +## Infrastructure layout + +- **Host Postfix** on the app server (`207.174.124.71`), reached by Listmonk via + SMTP at `172.18.0.1:25`. +- **Sending IPs:** `207.174.124.90` through `.109` (20 IPs), each with valid + FCrDNS (`mtaNN.performancewest.net`) and authorized in SPF (`-all`). + - `.90` / `mta01`: historically a dedicated Yahoo trickle IP. We no longer mail + Yahoo at all, so it is idle. + - `.91-.109` / `mta02-mta20`: rotation pool, selected via + `transport_maps = hash:/etc/postfix/transport, randmap:{}`. +- **Warmup scheduler:** `/usr/local/bin/pw-mta-warmup` (daily cron + `/etc/cron.d/pw-mta-warmup`, 07:17 UTC). Recomputes the active rotation pool + from a start date stamped in `/etc/postfix/pw-warmup-start`. Ramp schedule: + day 0-3 -> 3 IPs, 4-7 -> 5, 8-11 -> 8, 12-17 -> 12, 18-24 -> 16, 25+ -> 19. + The pool only ever grows. It picks IPs from the front of the `ALL=(...)` array. + +## What we do NOT mail + +The **Yahoo / Verizon-Media family** is excluded entirely (yahoo, aol, att, +verizon, frontier, sbcglobal, bellsouth, pacbell, ameritech, ymail, rocketmail, +aim, netscape, compuserve, etc.). They aggressively defer cold senders with +`421 4.7.0 [TSS04] ... unexpected volume or user complaints`, and that deferral +poisons the sending IP for Gmail and Microsoft too. + +Enforced in two layers: +1. **Audience build** (authoritative): `scripts/_email_exclusions.py` + (`BLOCKED_EMAIL_DOMAINS`), imported by `build_trucking_campaigns.py` and + `populate_new_carrier_startup_campaign.py`. New campaigns never include them. +2. **Postfix backstop:** `/etc/postfix/transport` maps every Yahoo-family domain + to `hold:`. If any leak into the queue they are parked, never sent from a + rotation IP. + +## Incident: May 30-31 2026 reputation collapse + +A campaign blast pushed ~29k sends in a day across cold IPs `.91/.92/.93` with no +daily volume cap. Result: +- Gmail: `550-5.7.1 ... likely unsolicited mail` (hard spam block). +- Yahoo: `421 TSS04` on the rotation IPs. +- Steady state afterward: ~13% delivery (10k sent vs 68k deferred + 7k bounced + in a day). Listmonk open rate ~4%, clicks ~0. + +### Remediation (Jun 02 2026) +- **Retired the 3 burned IPs** (`.91/.92/.93` = out02/03/04) from rotation. + Confirmed `.94-.109` had never sent outbound (only inbound port-scan noise), + so they are pristine. +- **Swapped rotation to fresh `.94/.95/.96`** (out05/06/07) and reset the warmup + start date to day 0. +- **Patched `pw-mta-warmup`** `ALL` array to start at `out05` so the daily cron + never reverts to the burned IPs. +- **Rewrote `/etc/postfix/transport`** to `hold:` the full Yahoo family (was a + partial list with buggy duplicate keys routing to `yahooslow`). +- **Flushed the entire stale queue** (1,846 blast-era messages, mostly dead + satellite ISPs) so fresh IPs start clean. +- **Enabled Listmonk sliding-window rate limit** so no campaign can blast again: + `app.message_sliding_window=true`, duration `1h`, rate `50`, `message_rate=2`. +- **Paused 19 trucking campaigns** (IDs 275-293, ~13k recipients) that were + scheduled to fire Jun 03; they were built before the exclusion fix and would + have re-torched the fresh IPs. Rebuild them small/clean before resending. + +## Fresh-IP warmup discipline (the rules) + +1. **Small audiences.** Day 0-3: a few hundred TOTAL per day, not per campaign. + Lower the `limit` values in `build_trucking_campaigns.py` segment specs while + warming. +2. **Best recipients first.** Only verified / engaged addresses. Gmail and + Microsoft only (Yahoo family already excluded). +3. **Scrub hard bounces immediately.** `550 5.1.1` (no such user), full mailbox, + "not our customer" all hurt reputation signals. +4. **Watch the signals daily** (see commands below). If Gmail `550-5.7.1` or + Yahoo `421 TSS04` reappear, STOP and hold for several days. +5. **Ramp Listmonk's sliding window in step with the IP warmup** (e.g. 50/h -> + 150/h -> 300/h as days pass and signals stay clean). Restart the listmonk + container after changing `settings`. + +## Monitoring commands + +```bash +# delivery mix today +sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -oE 'status=(sent|deferred|bounced)' | sort | uniq -c + +# per-IP outbound volume today (catch a runaway blast early) +for ip in 94 95 96; do echo -n ".$ip: "; sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -c "207.174.124.$ip"; done + +# top deferral / bounce reasons today +sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep status=deferred | grep -oE 'said: [0-9]{3}[^)]{0,50}' | sort | uniq -c | sort -rn | head + +# queue size +sudo postqueue -p | tail -1 + +# active rotation pool + warmup day +sudo postconf -h transport_maps +echo $(( ($(date +%s) - $(sudo cat /etc/postfix/pw-warmup-start)) / 86400 )) +``` + +## Backups left on the server (Jun 02 2026 remediation) +- `/etc/postfix/main.cf.bak.*` +- `/etc/postfix/transport.bak.*` +- `/usr/local/bin/pw-mta-warmup.bak.*`