infra(mail): fix warmed sending IPs dropping off ens18 on reboot (Jun 24 outage)

Unattended kernel-upgrade reboot (Jun 24 04:04) left only .71 bound because
classic ifupdown applies just the first 'address' line. Postfix then failed to
bind .94/.107 ('Cannot assign requested address') and silently egressed from
.71 -- which is NOT in SPF (every fallback msg failed SPF) and is on RLR621 +
Trend ERS-QIL. ~37h of bypassed IP-warming + a near-zero sales day.

Fixes:
- /etc/network/interfaces: explicit up/down ip-addr hooks for .72/.94/.107
- pw-mail-ips.service: systemd oneshot re-binds IPs + flushes queue on boot
- pw-mail-ip-watchdog: */5 cron re-binds missing IPs + flushes, also catches
  'Cannot assign' bind failures
- runbook: full incident writeup + reboot-test lesson

Host already remediated live; this commits the host artifacts + docs.
This commit is contained in:
justin 2026-06-25 17:28:33 -05:00
parent 7ad4c920c6
commit 4276adab80
3 changed files with 74 additions and 0 deletions

View file

@ -322,3 +322,44 @@ All discovered during the post-incident technical audit; each fix is codified.
clicks** (the same junked-mail signature as the trucking blasts). Any future clicks** (the same junked-mail signature as the trucking blasts). Any future
telecom UI campaign should set an altbody (Listmonk "Plain text" toggle) and telecom UI campaign should set an altbody (Listmonk "Plain text" toggle) and
run through the same dead-ISP/suppression hygiene. Commit `b375385`. run through the same dead-ISP/suppression hygiene. Commit `b375385`.
## INCIDENT 2026-06-24: warmed sending IPs dropped off the interface after reboot
**Impact:** ~37h of degraded deliverability + a near-zero sales day (Jun 24 04:04 -> Jun 25 17:25). Root cause was infrastructure, not reputation.
**What happened.** An unattended kernel upgrade rebooted the host at Jun 24 04:04
(6.12.90 -> 6.12.94). The warmed sending IPs `.94` (trucking/out05) and `.107`
(HC/hcout1) are defined in `/etc/network/interfaces`, but **classic ifupdown
(0.8.44) only applies the FIRST `address` line per stanza** -- so only `.71`
(the primary) came back up. Postfix's `smtp_bind_address=.94/.107` then failed
with `warning: smtp_connect_addr: bind ...: Cannot assign requested address` and
**silently fell back to egressing from `.71`**. `.71` is (a) NOT in the SPF
record (`v=spf1 ... ip4:.94 ip4:.107 -all`) so every fallback message **failed
SPF**, and (b) listed on **RLR621** + **Trend Micro ERS-QIL**, so receivers
deferred them (`451 ... blacklisted - RLR621 - ip=<207.174.124.71>`). Net: the
IP warming was bypassed and mail either failed SPF or got reputation-deferred.
**Detection.** Tail of `/var/log/mail.log` showed `Cannot assign requested
address` (16,993 in one log) + deferrals citing `ip=<207.174.124.71>`.
`ip -4 addr show ens18` showed only `.71` bound (missing `.72/.94/.107`).
`last reboot` pinned the start to the 04:04 boot. Major RBLs (Spamhaus ZEN/DBL,
Barracuda, SpamCop, SORBS) were still **clean** for `.94/.107` and the domain --
RLR621/ERS-QIL are proprietary soft listings keyed off `.71`/HELO and age off.
**Fix (all applied 2026-06-25 ~17:25 CDT).**
1. Re-bound live: `ip addr add 207.174.124.{72,94,107}/23 dev ens18`, then `postqueue -f`.
2. Reboot-persistence in `/etc/network/interfaces`: added explicit
`up/down ip addr add/del ...` hooks for the 3 secondaries (classic ifupdown
ignores 2nd+ `address` lines; the hooks are honored). Backup at
`/etc/network/interfaces.bak-*`.
3. Belt-and-suspenders systemd oneshot `pw-mail-ips.service` (in repo at
`infra/mail/pw-mail-ips.service`) re-binds the IPs + flushes the queue on boot.
4. Watchdog cron `*/5` `pw-mail-ip-watchdog` (repo `infra/mail/`) re-binds any
missing sending IP and flushes if it had to act or sees `Cannot assign` lines.
**Lesson / TODO.** The host does unattended-upgrade reboots ~weekly (seen
05-25, 05-30, 06-24, all ~04:04). Any IP/transport change must be reboot-tested.
Consider migrating ifupdown -> netplan with all addresses, or pin
`unattended-upgrades` to skip auto-reboot. The `mail_reputation_monitor.py`
attributes egress to `.71` as "transactional default" -- after this incident, a
spike of `.71` egress in the bulk streams is itself an alarm.

20
infra/mail/pw-mail-ip-watchdog Executable file
View file

@ -0,0 +1,20 @@
#!/bin/sh
# Guard against the Jun 24 incident: an unattended reboot dropped the warmed
# sending IPs (.94/.107) off ens18 because classic ifupdown only applies the
# first "address" line. Postfix then fell back to egressing from .71 (NOT in
# SPF, on RLR621/Trend ERS-QIL) for ~37h, tanking deliverability silently.
# This re-binds any missing sending IP and logs/flushes if it had to act.
CHANGED=0
for ip in 207.174.124.72 207.174.124.94 207.174.124.107; do
if ! ip addr show ens18 | grep -q "$ip/"; then
ip addr add "$ip/23" dev ens18 && CHANGED=1
logger -t pw-mail-ip-watchdog "re-bound missing sending IP $ip to ens18"
fi
done
# Also catch silent bind failures even if the IP looks present.
if tail -n 500 /var/log/mail.log 2>/dev/null | grep -q "Cannot assign requested address"; then
logger -t pw-mail-ip-watchdog "postfix bind failures detected in recent mail.log"
CHANGED=1
fi
[ "$CHANGED" = 1 ] && /usr/sbin/postqueue -f 2>/dev/null
exit 0

View file

@ -0,0 +1,13 @@
[Unit]
Description=Ensure Performance West mail sending IPs are bound to ens18
After=network-online.target networking.service
Wants=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/sh -c "for ip in 207.174.124.72 207.174.124.94 207.174.124.107; do ip addr show ens18 | grep -q \"$ip/\" || ip addr add $ip/23 dev ens18; done"
ExecStart=/usr/sbin/postqueue -f
[Install]
WantedBy=multi-user.target