docs: email deliverability + IP warmup runbook
Document the self-hosted MTA layout, the May 30-31 reputation collapse, the Jun 02 remediation (retired burned IPs .91/.92/.93, swapped rotation to fresh .94/.95/.96, full Yahoo-family hold map, Listmonk sliding-window cap, paused the 13k-recipient blast scheduled for Jun 03), and the fresh-IP warmup rules + monitoring commands.
This commit is contained in:
parent
344300ebd4
commit
98bcf0bbb0
1 changed files with 106 additions and 0 deletions
106
docs/email-deliverability-runbook.md
Normal file
106
docs/email-deliverability-runbook.md
Normal file
|
|
@ -0,0 +1,106 @@
|
|||
# Email Deliverability & IP Warmup Runbook
|
||||
|
||||
Performance West self-hosts its outbound MTA (Postfix on the app server) because
|
||||
transactional relays (SES, Postmark, SendGrid) forbid the cold prospecting email
|
||||
our FMCSA trucking and telecom campaigns depend on. That means **we own our
|
||||
sending-IP reputation** and must manage it manually. This doc is the operational
|
||||
guide for keeping it healthy.
|
||||
|
||||
## Infrastructure layout
|
||||
|
||||
- **Host Postfix** on the app server (`207.174.124.71`), reached by Listmonk via
|
||||
SMTP at `172.18.0.1:25`.
|
||||
- **Sending IPs:** `207.174.124.90` through `.109` (20 IPs), each with valid
|
||||
FCrDNS (`mtaNN.performancewest.net`) and authorized in SPF (`-all`).
|
||||
- `.90` / `mta01`: historically a dedicated Yahoo trickle IP. We no longer mail
|
||||
Yahoo at all, so it is idle.
|
||||
- `.91-.109` / `mta02-mta20`: rotation pool, selected via
|
||||
`transport_maps = hash:/etc/postfix/transport, randmap:{<active pool>}`.
|
||||
- **Warmup scheduler:** `/usr/local/bin/pw-mta-warmup` (daily cron
|
||||
`/etc/cron.d/pw-mta-warmup`, 07:17 UTC). Recomputes the active rotation pool
|
||||
from a start date stamped in `/etc/postfix/pw-warmup-start`. Ramp schedule:
|
||||
day 0-3 -> 3 IPs, 4-7 -> 5, 8-11 -> 8, 12-17 -> 12, 18-24 -> 16, 25+ -> 19.
|
||||
The pool only ever grows. It picks IPs from the front of the `ALL=(...)` array.
|
||||
|
||||
## What we do NOT mail
|
||||
|
||||
The **Yahoo / Verizon-Media family** is excluded entirely (yahoo, aol, att,
|
||||
verizon, frontier, sbcglobal, bellsouth, pacbell, ameritech, ymail, rocketmail,
|
||||
aim, netscape, compuserve, etc.). They aggressively defer cold senders with
|
||||
`421 4.7.0 [TSS04] ... unexpected volume or user complaints`, and that deferral
|
||||
poisons the sending IP for Gmail and Microsoft too.
|
||||
|
||||
Enforced in two layers:
|
||||
1. **Audience build** (authoritative): `scripts/_email_exclusions.py`
|
||||
(`BLOCKED_EMAIL_DOMAINS`), imported by `build_trucking_campaigns.py` and
|
||||
`populate_new_carrier_startup_campaign.py`. New campaigns never include them.
|
||||
2. **Postfix backstop:** `/etc/postfix/transport` maps every Yahoo-family domain
|
||||
to `hold:`. If any leak into the queue they are parked, never sent from a
|
||||
rotation IP.
|
||||
|
||||
## Incident: May 30-31 2026 reputation collapse
|
||||
|
||||
A campaign blast pushed ~29k sends in a day across cold IPs `.91/.92/.93` with no
|
||||
daily volume cap. Result:
|
||||
- Gmail: `550-5.7.1 ... likely unsolicited mail` (hard spam block).
|
||||
- Yahoo: `421 TSS04` on the rotation IPs.
|
||||
- Steady state afterward: ~13% delivery (10k sent vs 68k deferred + 7k bounced
|
||||
in a day). Listmonk open rate ~4%, clicks ~0.
|
||||
|
||||
### Remediation (Jun 02 2026)
|
||||
- **Retired the 3 burned IPs** (`.91/.92/.93` = out02/03/04) from rotation.
|
||||
Confirmed `.94-.109` had never sent outbound (only inbound port-scan noise),
|
||||
so they are pristine.
|
||||
- **Swapped rotation to fresh `.94/.95/.96`** (out05/06/07) and reset the warmup
|
||||
start date to day 0.
|
||||
- **Patched `pw-mta-warmup`** `ALL` array to start at `out05` so the daily cron
|
||||
never reverts to the burned IPs.
|
||||
- **Rewrote `/etc/postfix/transport`** to `hold:` the full Yahoo family (was a
|
||||
partial list with buggy duplicate keys routing to `yahooslow`).
|
||||
- **Flushed the entire stale queue** (1,846 blast-era messages, mostly dead
|
||||
satellite ISPs) so fresh IPs start clean.
|
||||
- **Enabled Listmonk sliding-window rate limit** so no campaign can blast again:
|
||||
`app.message_sliding_window=true`, duration `1h`, rate `50`, `message_rate=2`.
|
||||
- **Paused 19 trucking campaigns** (IDs 275-293, ~13k recipients) that were
|
||||
scheduled to fire Jun 03; they were built before the exclusion fix and would
|
||||
have re-torched the fresh IPs. Rebuild them small/clean before resending.
|
||||
|
||||
## Fresh-IP warmup discipline (the rules)
|
||||
|
||||
1. **Small audiences.** Day 0-3: a few hundred TOTAL per day, not per campaign.
|
||||
Lower the `limit` values in `build_trucking_campaigns.py` segment specs while
|
||||
warming.
|
||||
2. **Best recipients first.** Only verified / engaged addresses. Gmail and
|
||||
Microsoft only (Yahoo family already excluded).
|
||||
3. **Scrub hard bounces immediately.** `550 5.1.1` (no such user), full mailbox,
|
||||
"not our customer" all hurt reputation signals.
|
||||
4. **Watch the signals daily** (see commands below). If Gmail `550-5.7.1` or
|
||||
Yahoo `421 TSS04` reappear, STOP and hold for several days.
|
||||
5. **Ramp Listmonk's sliding window in step with the IP warmup** (e.g. 50/h ->
|
||||
150/h -> 300/h as days pass and signals stay clean). Restart the listmonk
|
||||
container after changing `settings`.
|
||||
|
||||
## Monitoring commands
|
||||
|
||||
```bash
|
||||
# delivery mix today
|
||||
sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -oE 'status=(sent|deferred|bounced)' | sort | uniq -c
|
||||
|
||||
# per-IP outbound volume today (catch a runaway blast early)
|
||||
for ip in 94 95 96; do echo -n ".$ip: "; sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -c "207.174.124.$ip"; done
|
||||
|
||||
# top deferral / bounce reasons today
|
||||
sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep status=deferred | grep -oE 'said: [0-9]{3}[^)]{0,50}' | sort | uniq -c | sort -rn | head
|
||||
|
||||
# queue size
|
||||
sudo postqueue -p | tail -1
|
||||
|
||||
# active rotation pool + warmup day
|
||||
sudo postconf -h transport_maps
|
||||
echo $(( ($(date +%s) - $(sudo cat /etc/postfix/pw-warmup-start)) / 86400 ))
|
||||
```
|
||||
|
||||
## Backups left on the server (Jun 02 2026 remediation)
|
||||
- `/etc/postfix/main.cf.bak.*`
|
||||
- `/etc/postfix/transport.bak.*`
|
||||
- `/usr/local/bin/pw-mta-warmup.bak.*`
|
||||
Loading…
Add table
Add a link
Reference in a new issue