# Email Deliverability & IP Warmup Runbook Performance West self-hosts its outbound MTA (Postfix on the app server) because transactional relays (SES, Postmark, SendGrid) forbid the cold prospecting email our FMCSA trucking and telecom campaigns depend on. That means **we own our sending-IP reputation** and must manage it manually. This doc is the operational guide for keeping it healthy. ## Infrastructure layout - **Host Postfix** on the app server (`207.174.124.71`), reached by Listmonk via SMTP at `172.18.0.1:25`. - **Sending IPs:** `207.174.124.90` through `.109` (20 IPs), each with valid FCrDNS (`mtaNN.performancewest.net`) and authorized in SPF (`-all`). - `.90` / `mta01`: historically a dedicated Yahoo trickle IP. We no longer mail Yahoo at all, so it is idle. - `.91-.109` / `mta02-mta20`: rotation pool, selected via `transport_maps = hash:/etc/postfix/transport, randmap:{}`. - **Warmup scheduler:** `/usr/local/bin/pw-mta-warmup` (daily cron `/etc/cron.d/pw-mta-warmup`, 07:17 UTC). Recomputes the active rotation pool from a start date stamped in `/etc/postfix/pw-warmup-start`. Ramp schedule: day 0-3 -> 3 IPs, 4-7 -> 5, 8-11 -> 8, 12-17 -> 12, 18-24 -> 16, 25+ -> 19. The pool only ever grows. It picks IPs from the front of the `ALL=(...)` array. ## What we do NOT mail The **Yahoo / Verizon-Media family** is excluded entirely (yahoo, aol, att, verizon, frontier, sbcglobal, bellsouth, pacbell, ameritech, ymail, rocketmail, aim, netscape, compuserve, etc.). They aggressively defer cold senders with `421 4.7.0 [TSS04] ... unexpected volume or user complaints`, and that deferral poisons the sending IP for Gmail and Microsoft too. Enforced in two layers: 1. **Audience build** (authoritative): `scripts/_email_exclusions.py` (`BLOCKED_EMAIL_DOMAINS`), imported by `build_trucking_campaigns.py` and `populate_new_carrier_startup_campaign.py`. New campaigns never include them. 2. **Postfix backstop:** `/etc/postfix/transport` maps every Yahoo-family domain to `hold:`. If any leak into the queue they are parked, never sent from a rotation IP. ## Incident: May 30-31 2026 reputation collapse A campaign blast pushed ~29k sends in a day across cold IPs `.91/.92/.93` with no daily volume cap. Result: - Gmail: `550-5.7.1 ... likely unsolicited mail` (hard spam block). - Yahoo: `421 TSS04` on the rotation IPs. - Steady state afterward: ~13% delivery (10k sent vs 68k deferred + 7k bounced in a day). Listmonk open rate ~4%, clicks ~0. ### Remediation (Jun 02 2026) - **Retired the 3 burned IPs** (`.91/.92/.93` = out02/03/04) from rotation. Confirmed `.94-.109` had never sent outbound (only inbound port-scan noise), so they are pristine. - **Swapped rotation to fresh `.94/.95/.96`** (out05/06/07) and reset the warmup start date to day 0. - **Patched `pw-mta-warmup`** `ALL` array to start at `out05` so the daily cron never reverts to the burned IPs. - **Rewrote `/etc/postfix/transport`** to `hold:` the full Yahoo family (was a partial list with buggy duplicate keys routing to `yahooslow`). - **Flushed the entire stale queue** (1,846 blast-era messages, mostly dead satellite ISPs) so fresh IPs start clean. - **Enabled Listmonk sliding-window rate limit** so no campaign can blast again: `app.message_sliding_window=true`, duration `1h`, rate `50`, `message_rate=2`. - **Paused 19 trucking campaigns** (IDs 275-293, ~13k recipients) that were scheduled to fire Jun 03; they were built before the exclusion fix and would have re-torched the fresh IPs. Rebuild them small/clean before resending. ## Fresh-IP warmup discipline (the rules) The historical mail.log proves these IPs sustain ~2,500 sends/day at 68-76% delivery once warm (May 19-21). Collapses only ever came from 17k-29k spikes. So we ramp ASSERTIVELY but never spike. The Listmonk sliding-window cap (`/usr/local/bin/pw-listmonk-rampcap`, daily cron 07:20 UTC, driven off the same `/etc/postfix/pw-warmup-start` stamp) enforces this automatically: | warmup day | hourly cap | ~daily total | |-----------:|-----------:|-------------:| | 0-1 | 50/h | ~500 | | 2-3 | 150/h | ~1,500 | | 4-6 | 250/h | ~2,500 | | 7+ | 300/h | ~3,000 (hard ceiling) | Hard rule from the data: **never exceed ~4k/day, never spike.** Other rules: 1. **Best recipients first.** Gmail + Microsoft + clean ISPs only (Yahoo family already excluded). Send small focused batches, e.g. `build_trucking_campaigns --only-segment mcs150 --max-per-segment 100 --date --send-hour `. 2. **Scrub hard bounces immediately.** `550 5.1.1`, full mailbox, "not our customer" all hurt reputation signals. 3. **Watch the signals daily** (see commands below). If Gmail `550-5.7.1` or Yahoo `421 TSS04` reappear, STOP and hold for several days. ## Monitoring commands ```bash # delivery mix today sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -oE 'status=(sent|deferred|bounced)' | sort | uniq -c # per-IP outbound volume today (catch a runaway blast early) for ip in 94 95 96; do echo -n ".$ip: "; sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -c "207.174.124.$ip"; done # top deferral / bounce reasons today sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep status=deferred | grep -oE 'said: [0-9]{3}[^)]{0,50}' | sort | uniq -c | sort -rn | head # queue size sudo postqueue -p | tail -1 # active rotation pool + warmup day sudo postconf -h transport_maps echo $(( ($(date +%s) - $(sudo cat /etc/postfix/pw-warmup-start)) / 86400 )) ``` ## Backups left on the server (Jun 02 2026 remediation) - `/etc/postfix/main.cf.bak.*` - `/etc/postfix/transport.bak.*` - `/usr/local/bin/pw-mta-warmup.bak.*` ## Incident: Jun 17 2026 — campaign mail sent UNSIGNED (no DKIM) **Symptom:** "no new sales." Campaigns were sending (~3-4k/day) but delivery was ~23% (sent 1,802 vs deferred 5,143 + bounced 580), Gmail returned `550-5.7.1 likely unsolicited mail`, and there were **zero clicks since Jun 8** despite ~600 opens/day. **Root cause:** OpenDKIM was signing **nothing** that came from Listmonk. `/etc/opendkim.conf` was in single-key mode with **no `InternalHosts`**, so it defaulted to signing only `127.0.0.1`. Cron/transactional mail is injected locally (127.0.0.1) so it WAS signed — but campaign mail is injected over the Docker bridge from the Listmonk containers (`172.18.0.5` trucking, `172.18.0.25` healthcare). Those clients were not "internal," so OpenDKIM *verified* (instead of *signed*) them: every cold email went out **unsigned**. Since Feb 2024 Gmail/Yahoo require DKIM on bulk mail, so unsigned campaigns were junked/blocked. Proof: `2,620` campaign messages that day, `0` "DKIM-Signature field added" events, while the every-5-min cron mail was signed. The correct table files already existed (`/etc/opendkim/{key.table, signing.table,trusted.hosts}`, and `trusted.hosts` already listed `172.16.0.0/12`) — they were simply **never wired into `opendkim.conf`**. **Fix (now codified in Ansible `roles/mail`):** point `opendkim.conf` at the tables and set the signing scope — ``` KeyTable refile:/etc/opendkim/key.table SigningTable refile:/etc/opendkim/signing.table InternalHosts /etc/opendkim/trusted.hosts # includes 172.16.0.0/12 (Docker) ExternalIgnoreList /etc/opendkim/trusted.hosts OversignHeaders From ``` then `systemctl restart opendkim`. This fixes BOTH streams at once: the healthcare submission instances (ports 2526-2528) inherit the global `smtpd_milters` and the `*@performancewest.net` signing table covers `compliance@`. Verified by injecting a message from a Docker IP through both port 25 and port 2526 and confirming "DKIM-Signature field added" for each. **Verify DKIM is actually signing campaign mail:** ```bash # Should be NON-ZERO and roughly track campaign volume: sudo journalctl -u opendkim --since today | grep -c 'DKIM-Signature field added' # Cross-check: campaign cleanup events today (should be similar order of magnitude) sudo grep "^$(date '+%b %e')" /var/log/mail.log | grep -c postfix/cleanup # Key still matches published DNS: sudo opendkim-testkey -d performancewest.net -s mail -vvv # expect "key OK" ``` **Still TODO from this incident (list quality + content, not yet done):** - Throttle/pause Gmail until reputation recovers (`550-5.7.1` was still firing). The trucking ramp/cap (`pw-listmonk-rampcap`) currently holds at 200/h and the builder excludes the big-MX operators (Google/Microsoft/...) until warmup day 30; revisit once reputation recovers. - Dead M365 tenant scrub: HC defers are mostly `451 4.4.4` against dead M365 tenants + `421` LuxSci throttle. Identify and suppress dead tenants. ### Follow-up hardening — DONE (Jun 17-18 2026) All discovered during the post-incident technical audit; each fix is codified. 1. **OpenDKIM not signing** — fixed + codified in Ansible `roles/mail` (commit `4d59019`). Foundational fix above. 2. **`mail.log` unbounded (~1 GB, no logrotate)** — this host logs via Postfix's built-in `postlogd` (no rsyslog), so a rename+create would strand the open inode. Added a `copytruncate` logrotate rule (daily, 14-day, compressed) to `roles/mail` (commit `2e4388a`). Applied live, 1 GB archive compressed. 3. **Plaintext (altbody) MIME part** — all campaigns were HTML-only (a spam-score signal; Listmonk only emits multipart/alternative when altbody is set). New `scripts/_email_plaintext.py` renders a text/plain part from the HTML body (preserves Listmonk template tags, links -> "text (url)"); wired into the trucking builder (and thus UCR + IFTA) and the healthcare builder. Tests: `scripts/test_email_plaintext.py`. Commits `a32a3b0`, `4664601`. 4. **`@localhost.localdomain` Message-IDs** — Listmonk derived the Message-ID from the random Docker container id. Pinned both listmonk + listmonk-hc `hostname: perfwest.performancewest.net` in `docker-compose.yml` (matches the SMTP `hello_hostname`). Commit `a32a3b0`. 5. **Dead/legacy/satellite ISP scrub** — added `DEAD_ISP_DOMAINS` (52 domains, identified from our own Listmonk bounce table) to `BLOCKED_EMAIL_DOMAINS` in `_email_exclusions.py`, so every builder that imports it stops cold-mailing them. Deliberately keeps still-active large consumer ISPs (comcast/charter/ cox/centurylink) — their bounces were the no-DKIM problem, not dead mailboxes. Commit `c183957`. 6. **`deploy@performancewest.net` self-bounce** — the deploy user's crontab held 3 jobs (payment_reminder, amb_location_scraper, renewal_worker) that are EXACT duplicates of systemd timers in the `worker-crons` role AND redirected to `/var/log` (which deploy cannot write), so they failed and cron mailed the error to `deploy@` (no mailbox -> self-bounce). Removed the redundant deploy crontab (backed up to `logs/deploy-crontab.bak.*`); the systemd timers carry the work. No IaC change needed (Ansible never created that crontab). 7. **Entire campaign pipeline was not in IaC** — the campaign cron builders, IP warmup/ramp helpers, and bounce watchers lived ONLY on the host. New Ansible `mail-pipeline` role + `playbooks/deploy-mail-pipeline.yml` deploy them all from the canonical repo copies (`infra/cron/`, `infra/postfix/`, `infra/monitoring/`, `infra/systemd/`, `scripts/*bounce*`). Commit `4dc5690`. 8. **Telecom + transactional email was also HTML-only** — the campaign-builder plaintext fix (#3) only covered Listmonk mass-mail. The telecom/filing/ customer-transactional path (499-Q reminders, RMD/FCC filing review links, intake/completion/delivery/commission emails, order confirmations) builds its own `MIMEMultipart` / nodemailer messages, and ~17 of them attached ONLY an HTML part — a malformed single-part `multipart/alternative` and a spam signal. Fixed at the source so all callers are covered: - `scripts/workers/worker_email.py` `send_worker_email()` now auto-derives the text/plain part from HTML via `_email_plaintext.html_to_text` when the caller omits `text=`. - 16 rolled-their-own Python senders (`scripts/workers/**`, `scripts/formation/ document_delivery.py`) attach an `html_to_text(...)` plaintext sibling before the HTML part (`job_server` + `document_delivery` wrap text+html in an `alternative` sub-part so PDF/DOCX still attach to the `mixed` root). - `api/src/email.ts` gained a dependency-free `htmlToText()` and `sendEmail` now defaults `text` to it (covers checkout/webhook HTML-only sends). NB: telecom campaigns themselves are still **manually** created+sent in the Listmonk UI (no send automation; `compliance_alert_list.py` / `rmd_deficiency_campaign.py` only populate lists). The one telecom send to date — campaign 407 "FCC Deficiency Report - FREEDOM249", Jun 08 — was HTML-only AND sent inside the DKIM-broken window: 384 sent / 343 views / **0 clicks** (the same junked-mail signature as the trucking blasts). Any future telecom UI campaign should set an altbody (Listmonk "Plain text" toggle) and run through the same dead-ISP/suppression hygiene. Commit `b375385`.