Unattended kernel-upgrade reboot (Jun 24 04:04) left only .71 bound because
classic ifupdown applies just the first 'address' line. Postfix then failed to
bind .94/.107 ('Cannot assign requested address') and silently egressed from
.71 -- which is NOT in SPF (every fallback msg failed SPF) and is on RLR621 +
Trend ERS-QIL. ~37h of bypassed IP-warming + a near-zero sales day.
Fixes:
- /etc/network/interfaces: explicit up/down ip-addr hooks for .72/.94/.107
- pw-mail-ips.service: systemd oneshot re-binds IPs + flushes queue on boot
- pw-mail-ip-watchdog: */5 cron re-binds missing IPs + flushes, also catches
'Cannot assign' bind failures
- runbook: full incident writeup + reboot-test lesson
Host already remediated live; this commits the host artifacts + docs.
Listmonk @TrackLink registers ONE static URL per tracked link and points
every recipient's /link/<uuid> redirect at it. On per-subscriber hrefs
({{ lp_link }}, ?dot=, ?npi=, ?clia=) this is doubly broken:
- the registered links.url was captured before the {{ lp_link }} token
rendered, yielding /order/slug&utm_source=... (first &, no ?) -> 404
- even when valid it collapses every carrier/provider onto the first
subscriber's dot/npi/clia value
Real human clicks are already tracked via Umami campaign-click (bot
filtered), so Listmonk link tracking here is redundant and destructive.
Stripped @TrackLink from per-subscriber CTAs:
- scripts/create_deficiency_source_campaigns.py (_cta, _dot_check_cta)
- data/trucking_campaigns/{ucr,ifta}_*.html
- data/hc_campaigns/*.html (10 templates)
Static CTAs (e.g. CRTC ?code= order link) keep @TrackLink (safe).
Live fix to the 10 broken registered links.url rows applied separately
(first & -> ?), backup in listmonk.pw_links_dkim_fix_bak_20260622.
Docs: new runbook incident section + corrected the disproven
'use @TrackLink on all CTAs' guidance in fmcsa/hc plans.
Records the MAIN_EXCLUDE_OPERATORS=google override, the resend_dkim_backup_20260622
rollback table, the past-send_at HTTP 400 gotcha (use --send-hour for same-day
re-runs), and the exact revert SQL. 6461-row backup; ~2999 re-sent Jun 22, rest
drain on subsequent daily runs (Gmail excluded, Microsoft/Hotmail included).
Root cause of the Jun 2026 deliverability collapse / 'no new sales':
opendkim.conf was in single-key mode with no InternalHosts, so it signed only
127.0.0.1. Transactional/cron mail (injected locally) was signed, but ALL
campaign mail -- injected over the Docker bridge from the Listmonk containers
(172.18.0.5 trucking, 172.18.0.25 healthcare) -- went out UNSIGNED. Gmail/Yahoo
require DKIM on bulk mail since Feb 2024, so cold campaigns were junked/blocked
(~23% delivery, 550-5.7.1). Proof: 2,620 campaign msgs that day, 0 DKIM sigs.
The correct table files already existed on the server but were never wired into
opendkim.conf. Fix points the daemon at key.table/signing.table and sets
InternalHosts/ExternalIgnoreList to trusted.hosts (which includes 172.16.0.0/12,
the Docker subnet). Fixes BOTH streams: HC submission ports 2526-2528 inherit
the global smtpd_milters and *@performancewest.net covers compliance@.
Verified by injecting from a Docker IP through port 25 and port 2526 -- both now
get 'DKIM-Signature field added'. Codified as new Ansible role 'mail' so it
can't silently regress (OpenDKIM was previously not in IaC at all).
Document the self-hosted MTA layout, the May 30-31 reputation collapse, the
Jun 02 remediation (retired burned IPs .91/.92/.93, swapped rotation to fresh
.94/.95/.96, full Yahoo-family hold map, Listmonk sliding-window cap, paused
the 13k-recipient blast scheduled for Jun 03), and the fresh-IP warmup rules +
monitoring commands.