infra(mail): fix warmed sending IPs dropping off ens18 on reboot (Jun 24 outage)
Unattended kernel-upgrade reboot (Jun 24 04:04) left only .71 bound because
classic ifupdown applies just the first 'address' line. Postfix then failed to
bind .94/.107 ('Cannot assign requested address') and silently egressed from
.71 -- which is NOT in SPF (every fallback msg failed SPF) and is on RLR621 +
Trend ERS-QIL. ~37h of bypassed IP-warming + a near-zero sales day.
Fixes:
- /etc/network/interfaces: explicit up/down ip-addr hooks for .72/.94/.107
- pw-mail-ips.service: systemd oneshot re-binds IPs + flushes queue on boot
- pw-mail-ip-watchdog: */5 cron re-binds missing IPs + flushes, also catches
'Cannot assign' bind failures
- runbook: full incident writeup + reboot-test lesson
Host already remediated live; this commits the host artifacts + docs.
This commit is contained in:
parent
7ad4c920c6
commit
4276adab80
3 changed files with 74 additions and 0 deletions
|
|
@ -322,3 +322,44 @@ All discovered during the post-incident technical audit; each fix is codified.
|
|||
clicks** (the same junked-mail signature as the trucking blasts). Any future
|
||||
telecom UI campaign should set an altbody (Listmonk "Plain text" toggle) and
|
||||
run through the same dead-ISP/suppression hygiene. Commit `b375385`.
|
||||
|
||||
## INCIDENT 2026-06-24: warmed sending IPs dropped off the interface after reboot
|
||||
|
||||
**Impact:** ~37h of degraded deliverability + a near-zero sales day (Jun 24 04:04 -> Jun 25 17:25). Root cause was infrastructure, not reputation.
|
||||
|
||||
**What happened.** An unattended kernel upgrade rebooted the host at Jun 24 04:04
|
||||
(6.12.90 -> 6.12.94). The warmed sending IPs `.94` (trucking/out05) and `.107`
|
||||
(HC/hcout1) are defined in `/etc/network/interfaces`, but **classic ifupdown
|
||||
(0.8.44) only applies the FIRST `address` line per stanza** -- so only `.71`
|
||||
(the primary) came back up. Postfix's `smtp_bind_address=.94/.107` then failed
|
||||
with `warning: smtp_connect_addr: bind ...: Cannot assign requested address` and
|
||||
**silently fell back to egressing from `.71`**. `.71` is (a) NOT in the SPF
|
||||
record (`v=spf1 ... ip4:.94 ip4:.107 -all`) so every fallback message **failed
|
||||
SPF**, and (b) listed on **RLR621** + **Trend Micro ERS-QIL**, so receivers
|
||||
deferred them (`451 ... blacklisted - RLR621 - ip=<207.174.124.71>`). Net: the
|
||||
IP warming was bypassed and mail either failed SPF or got reputation-deferred.
|
||||
|
||||
**Detection.** Tail of `/var/log/mail.log` showed `Cannot assign requested
|
||||
address` (16,993 in one log) + deferrals citing `ip=<207.174.124.71>`.
|
||||
`ip -4 addr show ens18` showed only `.71` bound (missing `.72/.94/.107`).
|
||||
`last reboot` pinned the start to the 04:04 boot. Major RBLs (Spamhaus ZEN/DBL,
|
||||
Barracuda, SpamCop, SORBS) were still **clean** for `.94/.107` and the domain --
|
||||
RLR621/ERS-QIL are proprietary soft listings keyed off `.71`/HELO and age off.
|
||||
|
||||
**Fix (all applied 2026-06-25 ~17:25 CDT).**
|
||||
1. Re-bound live: `ip addr add 207.174.124.{72,94,107}/23 dev ens18`, then `postqueue -f`.
|
||||
2. Reboot-persistence in `/etc/network/interfaces`: added explicit
|
||||
`up/down ip addr add/del ...` hooks for the 3 secondaries (classic ifupdown
|
||||
ignores 2nd+ `address` lines; the hooks are honored). Backup at
|
||||
`/etc/network/interfaces.bak-*`.
|
||||
3. Belt-and-suspenders systemd oneshot `pw-mail-ips.service` (in repo at
|
||||
`infra/mail/pw-mail-ips.service`) re-binds the IPs + flushes the queue on boot.
|
||||
4. Watchdog cron `*/5` `pw-mail-ip-watchdog` (repo `infra/mail/`) re-binds any
|
||||
missing sending IP and flushes if it had to act or sees `Cannot assign` lines.
|
||||
|
||||
**Lesson / TODO.** The host does unattended-upgrade reboots ~weekly (seen
|
||||
05-25, 05-30, 06-24, all ~04:04). Any IP/transport change must be reboot-tested.
|
||||
Consider migrating ifupdown -> netplan with all addresses, or pin
|
||||
`unattended-upgrades` to skip auto-reboot. The `mail_reputation_monitor.py`
|
||||
attributes egress to `.71` as "transactional default" -- after this incident, a
|
||||
spike of `.71` egress in the bulk streams is itself an alarm.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue