diff --git a/docs/email-deliverability-runbook.md b/docs/email-deliverability-runbook.md index c43ca29..41f7dbd 100644 --- a/docs/email-deliverability-runbook.md +++ b/docs/email-deliverability-runbook.md @@ -322,3 +322,44 @@ All discovered during the post-incident technical audit; each fix is codified. clicks** (the same junked-mail signature as the trucking blasts). Any future telecom UI campaign should set an altbody (Listmonk "Plain text" toggle) and run through the same dead-ISP/suppression hygiene. Commit `b375385`. + +## INCIDENT 2026-06-24: warmed sending IPs dropped off the interface after reboot + +**Impact:** ~37h of degraded deliverability + a near-zero sales day (Jun 24 04:04 -> Jun 25 17:25). Root cause was infrastructure, not reputation. + +**What happened.** An unattended kernel upgrade rebooted the host at Jun 24 04:04 +(6.12.90 -> 6.12.94). The warmed sending IPs `.94` (trucking/out05) and `.107` +(HC/hcout1) are defined in `/etc/network/interfaces`, but **classic ifupdown +(0.8.44) only applies the FIRST `address` line per stanza** -- so only `.71` +(the primary) came back up. Postfix's `smtp_bind_address=.94/.107` then failed +with `warning: smtp_connect_addr: bind ...: Cannot assign requested address` and +**silently fell back to egressing from `.71`**. `.71` is (a) NOT in the SPF +record (`v=spf1 ... ip4:.94 ip4:.107 -all`) so every fallback message **failed +SPF**, and (b) listed on **RLR621** + **Trend Micro ERS-QIL**, so receivers +deferred them (`451 ... blacklisted - RLR621 - ip=<207.174.124.71>`). Net: the +IP warming was bypassed and mail either failed SPF or got reputation-deferred. + +**Detection.** Tail of `/var/log/mail.log` showed `Cannot assign requested +address` (16,993 in one log) + deferrals citing `ip=<207.174.124.71>`. +`ip -4 addr show ens18` showed only `.71` bound (missing `.72/.94/.107`). +`last reboot` pinned the start to the 04:04 boot. Major RBLs (Spamhaus ZEN/DBL, +Barracuda, SpamCop, SORBS) were still **clean** for `.94/.107` and the domain -- +RLR621/ERS-QIL are proprietary soft listings keyed off `.71`/HELO and age off. + +**Fix (all applied 2026-06-25 ~17:25 CDT).** +1. Re-bound live: `ip addr add 207.174.124.{72,94,107}/23 dev ens18`, then `postqueue -f`. +2. Reboot-persistence in `/etc/network/interfaces`: added explicit + `up/down ip addr add/del ...` hooks for the 3 secondaries (classic ifupdown + ignores 2nd+ `address` lines; the hooks are honored). Backup at + `/etc/network/interfaces.bak-*`. +3. Belt-and-suspenders systemd oneshot `pw-mail-ips.service` (in repo at + `infra/mail/pw-mail-ips.service`) re-binds the IPs + flushes the queue on boot. +4. Watchdog cron `*/5` `pw-mail-ip-watchdog` (repo `infra/mail/`) re-binds any + missing sending IP and flushes if it had to act or sees `Cannot assign` lines. + +**Lesson / TODO.** The host does unattended-upgrade reboots ~weekly (seen +05-25, 05-30, 06-24, all ~04:04). Any IP/transport change must be reboot-tested. +Consider migrating ifupdown -> netplan with all addresses, or pin +`unattended-upgrades` to skip auto-reboot. The `mail_reputation_monitor.py` +attributes egress to `.71` as "transactional default" -- after this incident, a +spike of `.71` egress in the bulk streams is itself an alarm. diff --git a/infra/mail/pw-mail-ip-watchdog b/infra/mail/pw-mail-ip-watchdog new file mode 100755 index 0000000..383dac7 --- /dev/null +++ b/infra/mail/pw-mail-ip-watchdog @@ -0,0 +1,20 @@ +#!/bin/sh +# Guard against the Jun 24 incident: an unattended reboot dropped the warmed +# sending IPs (.94/.107) off ens18 because classic ifupdown only applies the +# first "address" line. Postfix then fell back to egressing from .71 (NOT in +# SPF, on RLR621/Trend ERS-QIL) for ~37h, tanking deliverability silently. +# This re-binds any missing sending IP and logs/flushes if it had to act. +CHANGED=0 +for ip in 207.174.124.72 207.174.124.94 207.174.124.107; do + if ! ip addr show ens18 | grep -q "$ip/"; then + ip addr add "$ip/23" dev ens18 && CHANGED=1 + logger -t pw-mail-ip-watchdog "re-bound missing sending IP $ip to ens18" + fi +done +# Also catch silent bind failures even if the IP looks present. +if tail -n 500 /var/log/mail.log 2>/dev/null | grep -q "Cannot assign requested address"; then + logger -t pw-mail-ip-watchdog "postfix bind failures detected in recent mail.log" + CHANGED=1 +fi +[ "$CHANGED" = 1 ] && /usr/sbin/postqueue -f 2>/dev/null +exit 0 diff --git a/infra/mail/pw-mail-ips.service b/infra/mail/pw-mail-ips.service new file mode 100644 index 0000000..5191083 --- /dev/null +++ b/infra/mail/pw-mail-ips.service @@ -0,0 +1,13 @@ +[Unit] +Description=Ensure Performance West mail sending IPs are bound to ens18 +After=network-online.target networking.service +Wants=network-online.target + +[Service] +Type=oneshot +RemainAfterExit=yes +ExecStart=/bin/sh -c "for ip in 207.174.124.72 207.174.124.94 207.174.124.107; do ip addr show ens18 | grep -q \"$ip/\" || ip addr add $ip/23 dev ens18; done" +ExecStart=/usr/sbin/postqueue -f + +[Install] +WantedBy=multi-user.target