new-site/docs/email-deliverability-runbook.md
justin 4d5901921e mail: fix OpenDKIM not signing campaign mail (Docker-injected) + codify in Ansible
Root cause of the Jun 2026 deliverability collapse / 'no new sales':
opendkim.conf was in single-key mode with no InternalHosts, so it signed only
127.0.0.1. Transactional/cron mail (injected locally) was signed, but ALL
campaign mail -- injected over the Docker bridge from the Listmonk containers
(172.18.0.5 trucking, 172.18.0.25 healthcare) -- went out UNSIGNED. Gmail/Yahoo
require DKIM on bulk mail since Feb 2024, so cold campaigns were junked/blocked
(~23% delivery, 550-5.7.1). Proof: 2,620 campaign msgs that day, 0 DKIM sigs.

The correct table files already existed on the server but were never wired into
opendkim.conf. Fix points the daemon at key.table/signing.table and sets
InternalHosts/ExternalIgnoreList to trusted.hosts (which includes 172.16.0.0/12,
the Docker subnet). Fixes BOTH streams: HC submission ports 2526-2528 inherit
the global smtpd_milters and *@performancewest.net covers compliance@.

Verified by injecting from a Docker IP through port 25 and port 2526 -- both now
get 'DKIM-Signature field added'. Codified as new Ansible role 'mail' so it
can't silently regress (OpenDKIM was previously not in IaC at all).
2026-06-17 19:31:19 -05:00

8.6 KiB

Email Deliverability & IP Warmup Runbook

Performance West self-hosts its outbound MTA (Postfix on the app server) because transactional relays (SES, Postmark, SendGrid) forbid the cold prospecting email our FMCSA trucking and telecom campaigns depend on. That means we own our sending-IP reputation and must manage it manually. This doc is the operational guide for keeping it healthy.

Infrastructure layout

  • Host Postfix on the app server (207.174.124.71), reached by Listmonk via SMTP at 172.18.0.1:25.
  • Sending IPs: 207.174.124.90 through .109 (20 IPs), each with valid FCrDNS (mtaNN.performancewest.net) and authorized in SPF (-all).
    • .90 / mta01: historically a dedicated Yahoo trickle IP. We no longer mail Yahoo at all, so it is idle.
    • .91-.109 / mta02-mta20: rotation pool, selected via transport_maps = hash:/etc/postfix/transport, randmap:{<active pool>}.
  • Warmup scheduler: /usr/local/bin/pw-mta-warmup (daily cron /etc/cron.d/pw-mta-warmup, 07:17 UTC). Recomputes the active rotation pool from a start date stamped in /etc/postfix/pw-warmup-start. Ramp schedule: day 0-3 -> 3 IPs, 4-7 -> 5, 8-11 -> 8, 12-17 -> 12, 18-24 -> 16, 25+ -> 19. The pool only ever grows. It picks IPs from the front of the ALL=(...) array.

What we do NOT mail

The Yahoo / Verizon-Media family is excluded entirely (yahoo, aol, att, verizon, frontier, sbcglobal, bellsouth, pacbell, ameritech, ymail, rocketmail, aim, netscape, compuserve, etc.). They aggressively defer cold senders with 421 4.7.0 [TSS04] ... unexpected volume or user complaints, and that deferral poisons the sending IP for Gmail and Microsoft too.

Enforced in two layers:

  1. Audience build (authoritative): scripts/_email_exclusions.py (BLOCKED_EMAIL_DOMAINS), imported by build_trucking_campaigns.py and populate_new_carrier_startup_campaign.py. New campaigns never include them.
  2. Postfix backstop: /etc/postfix/transport maps every Yahoo-family domain to hold:. If any leak into the queue they are parked, never sent from a rotation IP.

Incident: May 30-31 2026 reputation collapse

A campaign blast pushed ~29k sends in a day across cold IPs .91/.92/.93 with no daily volume cap. Result:

  • Gmail: 550-5.7.1 ... likely unsolicited mail (hard spam block).
  • Yahoo: 421 TSS04 on the rotation IPs.
  • Steady state afterward: ~13% delivery (10k sent vs 68k deferred + 7k bounced in a day). Listmonk open rate ~4%, clicks ~0.

Remediation (Jun 02 2026)

  • Retired the 3 burned IPs (.91/.92/.93 = out02/03/04) from rotation. Confirmed .94-.109 had never sent outbound (only inbound port-scan noise), so they are pristine.
  • Swapped rotation to fresh .94/.95/.96 (out05/06/07) and reset the warmup start date to day 0.
  • Patched pw-mta-warmup ALL array to start at out05 so the daily cron never reverts to the burned IPs.
  • Rewrote /etc/postfix/transport to hold: the full Yahoo family (was a partial list with buggy duplicate keys routing to yahooslow).
  • Flushed the entire stale queue (1,846 blast-era messages, mostly dead satellite ISPs) so fresh IPs start clean.
  • Enabled Listmonk sliding-window rate limit so no campaign can blast again: app.message_sliding_window=true, duration 1h, rate 50, message_rate=2.
  • Paused 19 trucking campaigns (IDs 275-293, ~13k recipients) that were scheduled to fire Jun 03; they were built before the exclusion fix and would have re-torched the fresh IPs. Rebuild them small/clean before resending.

Fresh-IP warmup discipline (the rules)

The historical mail.log proves these IPs sustain ~2,500 sends/day at 68-76% delivery once warm (May 19-21). Collapses only ever came from 17k-29k spikes. So we ramp ASSERTIVELY but never spike. The Listmonk sliding-window cap (/usr/local/bin/pw-listmonk-rampcap, daily cron 07:20 UTC, driven off the same /etc/postfix/pw-warmup-start stamp) enforces this automatically:

warmup day hourly cap ~daily total
0-1 50/h ~500
2-3 150/h ~1,500
4-6 250/h ~2,500
7+ 300/h ~3,000 (hard ceiling)

Hard rule from the data: never exceed ~4k/day, never spike.

Other rules:

  1. Best recipients first. Gmail + Microsoft + clean ISPs only (Yahoo family already excluded). Send small focused batches, e.g. build_trucking_campaigns --only-segment mcs150 --max-per-segment 100 --date <today> --send-hour <H>.
  2. Scrub hard bounces immediately. 550 5.1.1, full mailbox, "not our customer" all hurt reputation signals.
  3. Watch the signals daily (see commands below). If Gmail 550-5.7.1 or Yahoo 421 TSS04 reappear, STOP and hold for several days.

Monitoring commands

# delivery mix today
sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -oE 'status=(sent|deferred|bounced)' | sort | uniq -c

# per-IP outbound volume today (catch a runaway blast early)
for ip in 94 95 96; do echo -n ".$ip: "; sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -c "207.174.124.$ip"; done

# top deferral / bounce reasons today
sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep status=deferred | grep -oE 'said: [0-9]{3}[^)]{0,50}' | sort | uniq -c | sort -rn | head

# queue size
sudo postqueue -p | tail -1

# active rotation pool + warmup day
sudo postconf -h transport_maps
echo $(( ($(date +%s) - $(sudo cat /etc/postfix/pw-warmup-start)) / 86400 ))

Backups left on the server (Jun 02 2026 remediation)

  • /etc/postfix/main.cf.bak.*
  • /etc/postfix/transport.bak.*
  • /usr/local/bin/pw-mta-warmup.bak.*

Incident: Jun 17 2026 — campaign mail sent UNSIGNED (no DKIM)

Symptom: "no new sales." Campaigns were sending (~3-4k/day) but delivery was ~23% (sent 1,802 vs deferred 5,143 + bounced 580), Gmail returned 550-5.7.1 likely unsolicited mail, and there were zero clicks since Jun 8 despite ~600 opens/day.

Root cause: OpenDKIM was signing nothing that came from Listmonk. /etc/opendkim.conf was in single-key mode with no InternalHosts, so it defaulted to signing only 127.0.0.1. Cron/transactional mail is injected locally (127.0.0.1) so it WAS signed — but campaign mail is injected over the Docker bridge from the Listmonk containers (172.18.0.5 trucking, 172.18.0.25 healthcare). Those clients were not "internal," so OpenDKIM verified (instead of signed) them: every cold email went out unsigned. Since Feb 2024 Gmail/Yahoo require DKIM on bulk mail, so unsigned campaigns were junked/blocked. Proof: 2,620 campaign messages that day, 0 "DKIM-Signature field added" events, while the every-5-min cron mail was signed.

The correct table files already existed (/etc/opendkim/{key.table, signing.table,trusted.hosts}, and trusted.hosts already listed 172.16.0.0/12) — they were simply never wired into opendkim.conf.

Fix (now codified in Ansible roles/mail): point opendkim.conf at the tables and set the signing scope —

KeyTable           refile:/etc/opendkim/key.table
SigningTable       refile:/etc/opendkim/signing.table
InternalHosts      /etc/opendkim/trusted.hosts   # includes 172.16.0.0/12 (Docker)
ExternalIgnoreList /etc/opendkim/trusted.hosts
OversignHeaders    From

then systemctl restart opendkim. This fixes BOTH streams at once: the healthcare submission instances (ports 2526-2528) inherit the global smtpd_milters and the *@performancewest.net signing table covers compliance@. Verified by injecting a message from a Docker IP through both port 25 and port 2526 and confirming "DKIM-Signature field added" for each.

Verify DKIM is actually signing campaign mail:

# Should be NON-ZERO and roughly track campaign volume:
sudo journalctl -u opendkim --since today | grep -c 'DKIM-Signature field added'
# Cross-check: campaign cleanup events today (should be similar order of magnitude)
sudo grep "^$(date '+%b %e')" /var/log/mail.log | grep -c postfix/cleanup
# Key still matches published DNS:
sudo opendkim-testkey -d performancewest.net -s mail -vvv   # expect "key OK"

Still TODO from this incident (list quality + content, not yet done):

  • Scrub dead rural/satellite ISPs + dead M365 tenants from audiences and suppress repeat-deferring/bouncing domains (extend _email_exclusions.py).
  • Throttle/pause Gmail until reputation recovers (550-5.7.1 was still firing).
  • Add a plaintext (altbody) MIME part — all campaigns are currently HTML-only, itself a spam signal.
  • Fix the self-bounce cron emailing the nonexistent deploy@performancewest.net (~700 self-inflicted 550 bounces/day).