Commit graph

8 commits

Author SHA1 Message Date
justin
4276adab80 infra(mail): fix warmed sending IPs dropping off ens18 on reboot (Jun 24 outage)
Unattended kernel-upgrade reboot (Jun 24 04:04) left only .71 bound because
classic ifupdown applies just the first 'address' line. Postfix then failed to
bind .94/.107 ('Cannot assign requested address') and silently egressed from
.71 -- which is NOT in SPF (every fallback msg failed SPF) and is on RLR621 +
Trend ERS-QIL. ~37h of bypassed IP-warming + a near-zero sales day.

Fixes:
- /etc/network/interfaces: explicit up/down ip-addr hooks for .72/.94/.107
- pw-mail-ips.service: systemd oneshot re-binds IPs + flushes queue on boot
- pw-mail-ip-watchdog: */5 cron re-binds missing IPs + flushes, also catches
  'Cannot assign' bind failures
- runbook: full incident writeup + reboot-test lesson

Host already remediated live; this commits the host artifacts + docs.
2026-06-25 17:28:33 -05:00
justin
3325259af7 fix(email): drop @TrackLink from per-subscriber CTAs (404 + collapse bug)
Listmonk @TrackLink registers ONE static URL per tracked link and points
every recipient's /link/<uuid> redirect at it. On per-subscriber hrefs
({{ lp_link }}, ?dot=, ?npi=, ?clia=) this is doubly broken:
 - the registered links.url was captured before the {{ lp_link }} token
   rendered, yielding /order/slug&utm_source=... (first &, no ?) -> 404
 - even when valid it collapses every carrier/provider onto the first
   subscriber's dot/npi/clia value

Real human clicks are already tracked via Umami campaign-click (bot
filtered), so Listmonk link tracking here is redundant and destructive.

Stripped @TrackLink from per-subscriber CTAs:
 - scripts/create_deficiency_source_campaigns.py (_cta, _dot_check_cta)
 - data/trucking_campaigns/{ucr,ifta}_*.html
 - data/hc_campaigns/*.html (10 templates)

Static CTAs (e.g. CRTC ?code= order link) keep @TrackLink (safe).
Live fix to the 10 broken registered links.url rows applied separately
(first & -> ?), backup in listmonk.pw_links_dkim_fix_bak_20260622.

Docs: new runbook incident section + corrected the disproven
'use @TrackLink on all CTAs' guidance in fmcsa/hc plans.
2026-06-22 17:01:39 -05:00
justin
62292b96af docs(deliverability): document Jun 22 re-send of never-delivered DKIM-window audience
Records the MAIN_EXCLUDE_OPERATORS=google override, the resend_dkim_backup_20260622
rollback table, the past-send_at HTTP 400 gotcha (use --send-hour for same-day
re-runs), and the exact revert SQL. 6461-row backup; ~2999 re-sent Jun 22, rest
drain on subsequent daily runs (Gmail excluded, Microsoft/Hotmail included).
2026-06-22 11:59:29 -05:00
justin
eba525f83f docs: runbook fix #8 — telecom/transactional HTML-only plaintext fix + campaign 407 finding 2026-06-17 21:17:06 -05:00
justin
4171f48736 docs: record post-incident email hardening (7 fixes) in runbook 2026-06-17 20:30:59 -05:00
justin
4d5901921e mail: fix OpenDKIM not signing campaign mail (Docker-injected) + codify in Ansible
Root cause of the Jun 2026 deliverability collapse / 'no new sales':
opendkim.conf was in single-key mode with no InternalHosts, so it signed only
127.0.0.1. Transactional/cron mail (injected locally) was signed, but ALL
campaign mail -- injected over the Docker bridge from the Listmonk containers
(172.18.0.5 trucking, 172.18.0.25 healthcare) -- went out UNSIGNED. Gmail/Yahoo
require DKIM on bulk mail since Feb 2024, so cold campaigns were junked/blocked
(~23% delivery, 550-5.7.1). Proof: 2,620 campaign msgs that day, 0 DKIM sigs.

The correct table files already existed on the server but were never wired into
opendkim.conf. Fix points the daemon at key.table/signing.table and sets
InternalHosts/ExternalIgnoreList to trusted.hosts (which includes 172.16.0.0/12,
the Docker subnet). Fixes BOTH streams: HC submission ports 2526-2528 inherit
the global smtpd_milters and *@performancewest.net covers compliance@.

Verified by injecting from a Docker IP through port 25 and port 2526 -- both now
get 'DKIM-Signature field added'. Codified as new Ansible role 'mail' so it
can't silently regress (OpenDKIM was previously not in IaC at all).
2026-06-17 19:31:19 -05:00
justin
8090fe0589 docs: ramp schedule + pw-listmonk-rampcap, fresh-IP day-0 send started 2026-06-02 12:32:42 -05:00
justin
98bcf0bbb0 docs: email deliverability + IP warmup runbook
Document the self-hosted MTA layout, the May 30-31 reputation collapse, the
Jun 02 remediation (retired burned IPs .91/.92/.93, swapped rotation to fresh
.94/.95/.96, full Yahoo-family hold map, Listmonk sliding-window cap, paused
the 13k-recipient blast scheduled for Jun 03), and the fresh-IP warmup rules +
monitoring commands.
2026-06-02 12:25:33 -05:00