From 4276adab8000e88d9be8aa4b9346a43dd608794b Mon Sep 17 00:00:00 2001 From: justin Date: Thu, 25 Jun 2026 17:28:33 -0500 Subject: [PATCH] infra(mail): fix warmed sending IPs dropping off ens18 on reboot (Jun 24 outage) Unattended kernel-upgrade reboot (Jun 24 04:04) left only .71 bound because classic ifupdown applies just the first 'address' line. Postfix then failed to bind .94/.107 ('Cannot assign requested address') and silently egressed from .71 -- which is NOT in SPF (every fallback msg failed SPF) and is on RLR621 + Trend ERS-QIL. ~37h of bypassed IP-warming + a near-zero sales day. Fixes: - /etc/network/interfaces: explicit up/down ip-addr hooks for .72/.94/.107 - pw-mail-ips.service: systemd oneshot re-binds IPs + flushes queue on boot - pw-mail-ip-watchdog: */5 cron re-binds missing IPs + flushes, also catches 'Cannot assign' bind failures - runbook: full incident writeup + reboot-test lesson Host already remediated live; this commits the host artifacts + docs. --- docs/email-deliverability-runbook.md | 41 ++++++++++++++++++++++++++++ infra/mail/pw-mail-ip-watchdog | 20 ++++++++++++++ infra/mail/pw-mail-ips.service | 13 +++++++++ 3 files changed, 74 insertions(+) create mode 100755 infra/mail/pw-mail-ip-watchdog create mode 100644 infra/mail/pw-mail-ips.service diff --git a/docs/email-deliverability-runbook.md b/docs/email-deliverability-runbook.md index c43ca29..41f7dbd 100644 --- a/docs/email-deliverability-runbook.md +++ b/docs/email-deliverability-runbook.md @@ -322,3 +322,44 @@ All discovered during the post-incident technical audit; each fix is codified. clicks** (the same junked-mail signature as the trucking blasts). Any future telecom UI campaign should set an altbody (Listmonk "Plain text" toggle) and run through the same dead-ISP/suppression hygiene. Commit `b375385`. + +## INCIDENT 2026-06-24: warmed sending IPs dropped off the interface after reboot + +**Impact:** ~37h of degraded deliverability + a near-zero sales day (Jun 24 04:04 -> Jun 25 17:25). Root cause was infrastructure, not reputation. + +**What happened.** An unattended kernel upgrade rebooted the host at Jun 24 04:04 +(6.12.90 -> 6.12.94). The warmed sending IPs `.94` (trucking/out05) and `.107` +(HC/hcout1) are defined in `/etc/network/interfaces`, but **classic ifupdown +(0.8.44) only applies the FIRST `address` line per stanza** -- so only `.71` +(the primary) came back up. Postfix's `smtp_bind_address=.94/.107` then failed +with `warning: smtp_connect_addr: bind ...: Cannot assign requested address` and +**silently fell back to egressing from `.71`**. `.71` is (a) NOT in the SPF +record (`v=spf1 ... ip4:.94 ip4:.107 -all`) so every fallback message **failed +SPF**, and (b) listed on **RLR621** + **Trend Micro ERS-QIL**, so receivers +deferred them (`451 ... blacklisted - RLR621 - ip=<207.174.124.71>`). Net: the +IP warming was bypassed and mail either failed SPF or got reputation-deferred. + +**Detection.** Tail of `/var/log/mail.log` showed `Cannot assign requested +address` (16,993 in one log) + deferrals citing `ip=<207.174.124.71>`. +`ip -4 addr show ens18` showed only `.71` bound (missing `.72/.94/.107`). +`last reboot` pinned the start to the 04:04 boot. Major RBLs (Spamhaus ZEN/DBL, +Barracuda, SpamCop, SORBS) were still **clean** for `.94/.107` and the domain -- +RLR621/ERS-QIL are proprietary soft listings keyed off `.71`/HELO and age off. + +**Fix (all applied 2026-06-25 ~17:25 CDT).** +1. Re-bound live: `ip addr add 207.174.124.{72,94,107}/23 dev ens18`, then `postqueue -f`. +2. Reboot-persistence in `/etc/network/interfaces`: added explicit + `up/down ip addr add/del ...` hooks for the 3 secondaries (classic ifupdown + ignores 2nd+ `address` lines; the hooks are honored). Backup at + `/etc/network/interfaces.bak-*`. +3. Belt-and-suspenders systemd oneshot `pw-mail-ips.service` (in repo at + `infra/mail/pw-mail-ips.service`) re-binds the IPs + flushes the queue on boot. +4. Watchdog cron `*/5` `pw-mail-ip-watchdog` (repo `infra/mail/`) re-binds any + missing sending IP and flushes if it had to act or sees `Cannot assign` lines. + +**Lesson / TODO.** The host does unattended-upgrade reboots ~weekly (seen +05-25, 05-30, 06-24, all ~04:04). Any IP/transport change must be reboot-tested. +Consider migrating ifupdown -> netplan with all addresses, or pin +`unattended-upgrades` to skip auto-reboot. The `mail_reputation_monitor.py` +attributes egress to `.71` as "transactional default" -- after this incident, a +spike of `.71` egress in the bulk streams is itself an alarm. diff --git a/infra/mail/pw-mail-ip-watchdog b/infra/mail/pw-mail-ip-watchdog new file mode 100755 index 0000000..383dac7 --- /dev/null +++ b/infra/mail/pw-mail-ip-watchdog @@ -0,0 +1,20 @@ +#!/bin/sh +# Guard against the Jun 24 incident: an unattended reboot dropped the warmed +# sending IPs (.94/.107) off ens18 because classic ifupdown only applies the +# first "address" line. Postfix then fell back to egressing from .71 (NOT in +# SPF, on RLR621/Trend ERS-QIL) for ~37h, tanking deliverability silently. +# This re-binds any missing sending IP and logs/flushes if it had to act. +CHANGED=0 +for ip in 207.174.124.72 207.174.124.94 207.174.124.107; do + if ! ip addr show ens18 | grep -q "$ip/"; then + ip addr add "$ip/23" dev ens18 && CHANGED=1 + logger -t pw-mail-ip-watchdog "re-bound missing sending IP $ip to ens18" + fi +done +# Also catch silent bind failures even if the IP looks present. +if tail -n 500 /var/log/mail.log 2>/dev/null | grep -q "Cannot assign requested address"; then + logger -t pw-mail-ip-watchdog "postfix bind failures detected in recent mail.log" + CHANGED=1 +fi +[ "$CHANGED" = 1 ] && /usr/sbin/postqueue -f 2>/dev/null +exit 0 diff --git a/infra/mail/pw-mail-ips.service b/infra/mail/pw-mail-ips.service new file mode 100644 index 0000000..5191083 --- /dev/null +++ b/infra/mail/pw-mail-ips.service @@ -0,0 +1,13 @@ +[Unit] +Description=Ensure Performance West mail sending IPs are bound to ens18 +After=network-online.target networking.service +Wants=network-online.target + +[Service] +Type=oneshot +RemainAfterExit=yes +ExecStart=/bin/sh -c "for ip in 207.174.124.72 207.174.124.94 207.174.124.107; do ip addr show ens18 | grep -q \"$ip/\" || ip addr add $ip/23 dev ens18; done" +ExecStart=/usr/sbin/postqueue -f + +[Install] +WantedBy=multi-user.target