Unattended kernel-upgrade reboot (Jun 24 04:04) left only .71 bound because
classic ifupdown applies just the first 'address' line. Postfix then failed to
bind .94/.107 ('Cannot assign requested address') and silently egressed from
.71 -- which is NOT in SPF (every fallback msg failed SPF) and is on RLR621 +
Trend ERS-QIL. ~37h of bypassed IP-warming + a near-zero sales day.
Fixes:
- /etc/network/interfaces: explicit up/down ip-addr hooks for .72/.94/.107
- pw-mail-ips.service: systemd oneshot re-binds IPs + flushes queue on boot
- pw-mail-ip-watchdog: */5 cron re-binds missing IPs + flushes, also catches
'Cannot assign' bind failures
- runbook: full incident writeup + reboot-test lesson
Host already remediated live; this commits the host artifacts + docs.
21 KiB
Email Deliverability & IP Warmup Runbook
Performance West self-hosts its outbound MTA (Postfix on the app server) because transactional relays (SES, Postmark, SendGrid) forbid the cold prospecting email our FMCSA trucking and telecom campaigns depend on. That means we own our sending-IP reputation and must manage it manually. This doc is the operational guide for keeping it healthy.
Infrastructure layout
- Host Postfix on the app server (
207.174.124.71), reached by Listmonk via SMTP at172.18.0.1:25. - Sending IPs:
207.174.124.90through.109(20 IPs), each with valid FCrDNS (mtaNN.performancewest.net) and authorized in SPF (-all)..90/mta01: historically a dedicated Yahoo trickle IP. We no longer mail Yahoo at all, so it is idle..91-.109/mta02-mta20: rotation pool, selected viatransport_maps = hash:/etc/postfix/transport, randmap:{<active pool>}.
- Warmup scheduler:
/usr/local/bin/pw-mta-warmup(daily cron/etc/cron.d/pw-mta-warmup, 07:17 UTC). Recomputes the active rotation pool from a start date stamped in/etc/postfix/pw-warmup-start. Ramp schedule: day 0-3 -> 3 IPs, 4-7 -> 5, 8-11 -> 8, 12-17 -> 12, 18-24 -> 16, 25+ -> 19. The pool only ever grows. It picks IPs from the front of theALL=(...)array.
What we do NOT mail
The Yahoo / Verizon-Media family is excluded entirely (yahoo, aol, att,
verizon, frontier, sbcglobal, bellsouth, pacbell, ameritech, ymail, rocketmail,
aim, netscape, compuserve, etc.). They aggressively defer cold senders with
421 4.7.0 [TSS04] ... unexpected volume or user complaints, and that deferral
poisons the sending IP for Gmail and Microsoft too.
Enforced in two layers:
- Audience build (authoritative):
scripts/_email_exclusions.py(BLOCKED_EMAIL_DOMAINS), imported bybuild_trucking_campaigns.pyandpopulate_new_carrier_startup_campaign.py. New campaigns never include them. - Postfix backstop:
/etc/postfix/transportmaps every Yahoo-family domain tohold:. If any leak into the queue they are parked, never sent from a rotation IP.
Incident: May 30-31 2026 reputation collapse
A campaign blast pushed ~29k sends in a day across cold IPs .91/.92/.93 with no
daily volume cap. Result:
- Gmail:
550-5.7.1 ... likely unsolicited mail(hard spam block). - Yahoo:
421 TSS04on the rotation IPs. - Steady state afterward: ~13% delivery (10k sent vs 68k deferred + 7k bounced in a day). Listmonk open rate ~4%, clicks ~0.
Remediation (Jun 02 2026)
- Retired the 3 burned IPs (
.91/.92/.93= out02/03/04) from rotation. Confirmed.94-.109had never sent outbound (only inbound port-scan noise), so they are pristine. - Swapped rotation to fresh
.94/.95/.96(out05/06/07) and reset the warmup start date to day 0. - Patched
pw-mta-warmupALLarray to start atout05so the daily cron never reverts to the burned IPs. - Rewrote
/etc/postfix/transporttohold:the full Yahoo family (was a partial list with buggy duplicate keys routing toyahooslow). - Flushed the entire stale queue (1,846 blast-era messages, mostly dead satellite ISPs) so fresh IPs start clean.
- Enabled Listmonk sliding-window rate limit so no campaign can blast again:
app.message_sliding_window=true, duration1h, rate50,message_rate=2. - Paused 19 trucking campaigns (IDs 275-293, ~13k recipients) that were scheduled to fire Jun 03; they were built before the exclusion fix and would have re-torched the fresh IPs. Rebuild them small/clean before resending.
Fresh-IP warmup discipline (the rules)
The historical mail.log proves these IPs sustain ~2,500 sends/day at 68-76%
delivery once warm (May 19-21). Collapses only ever came from 17k-29k spikes.
So we ramp ASSERTIVELY but never spike. The Listmonk sliding-window cap
(/usr/local/bin/pw-listmonk-rampcap, daily cron 07:20 UTC, driven off the same
/etc/postfix/pw-warmup-start stamp) enforces this automatically:
| warmup day | hourly cap | ~daily total |
|---|---|---|
| 0-1 | 50/h | ~500 |
| 2-3 | 150/h | ~1,500 |
| 4-6 | 250/h | ~2,500 |
| 7+ | 300/h | ~3,000 (hard ceiling) |
Hard rule from the data: never exceed ~4k/day, never spike.
Other rules:
- Best recipients first. Gmail + Microsoft + clean ISPs only (Yahoo family
already excluded). Send small focused batches, e.g.
build_trucking_campaigns --only-segment mcs150 --max-per-segment 100 --date <today> --send-hour <H>. - Scrub hard bounces immediately.
550 5.1.1, full mailbox, "not our customer" all hurt reputation signals. - Watch the signals daily (see commands below). If Gmail
550-5.7.1or Yahoo421 TSS04reappear, STOP and hold for several days.
Monitoring commands
# delivery mix today
sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -oE 'status=(sent|deferred|bounced)' | sort | uniq -c
# per-IP outbound volume today (catch a runaway blast early)
for ip in 94 95 96; do echo -n ".$ip: "; sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -c "207.174.124.$ip"; done
# top deferral / bounce reasons today
sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep status=deferred | grep -oE 'said: [0-9]{3}[^)]{0,50}' | sort | uniq -c | sort -rn | head
# queue size
sudo postqueue -p | tail -1
# active rotation pool + warmup day
sudo postconf -h transport_maps
echo $(( ($(date +%s) - $(sudo cat /etc/postfix/pw-warmup-start)) / 86400 ))
Backups left on the server (Jun 02 2026 remediation)
/etc/postfix/main.cf.bak.*/etc/postfix/transport.bak.*/usr/local/bin/pw-mta-warmup.bak.*
Incident: Jun 17 2026 — campaign mail sent UNSIGNED (no DKIM)
Symptom: "no new sales." Campaigns were sending (~3-4k/day) but delivery was
~23% (sent 1,802 vs deferred 5,143 + bounced 580), Gmail returned 550-5.7.1 likely unsolicited mail, and there were zero clicks since Jun 8 despite
~600 opens/day.
Root cause: OpenDKIM was signing nothing that came from Listmonk.
/etc/opendkim.conf was in single-key mode with no InternalHosts, so it
defaulted to signing only 127.0.0.1. Cron/transactional mail is injected
locally (127.0.0.1) so it WAS signed — but campaign mail is injected over the
Docker bridge from the Listmonk containers (172.18.0.5 trucking,
172.18.0.25 healthcare). Those clients were not "internal," so OpenDKIM
verified (instead of signed) them: every cold email went out unsigned.
Since Feb 2024 Gmail/Yahoo require DKIM on bulk mail, so unsigned campaigns were
junked/blocked. Proof: 2,620 campaign messages that day, 0 "DKIM-Signature
field added" events, while the every-5-min cron mail was signed.
The correct table files already existed (/etc/opendkim/{key.table, signing.table,trusted.hosts}, and trusted.hosts already listed
172.16.0.0/12) — they were simply never wired into opendkim.conf.
Fix (now codified in Ansible roles/mail): point opendkim.conf at the
tables and set the signing scope —
KeyTable refile:/etc/opendkim/key.table
SigningTable refile:/etc/opendkim/signing.table
InternalHosts /etc/opendkim/trusted.hosts # includes 172.16.0.0/12 (Docker)
ExternalIgnoreList /etc/opendkim/trusted.hosts
OversignHeaders From
then systemctl restart opendkim. This fixes BOTH streams at once: the
healthcare submission instances (ports 2526-2528) inherit the global
smtpd_milters and the *@performancewest.net signing table covers
compliance@. Verified by injecting a message from a Docker IP through both
port 25 and port 2526 and confirming "DKIM-Signature field added" for each.
Verify DKIM is actually signing campaign mail:
# Should be NON-ZERO and roughly track campaign volume:
sudo journalctl -u opendkim --since today | grep -c 'DKIM-Signature field added'
# Cross-check: campaign cleanup events today (should be similar order of magnitude)
sudo grep "^$(date '+%b %e')" /var/log/mail.log | grep -c postfix/cleanup
# Key still matches published DNS:
sudo opendkim-testkey -d performancewest.net -s mail -vvv # expect "key OK"
Still TODO from this incident (list quality + content, not yet done):
- Throttle/pause Gmail until reputation recovers (
550-5.7.1was still firing). The trucking ramp/cap (pw-listmonk-rampcap) currently holds at 200/h and the builder excludes the big-MX operators (Google/Microsoft/...) until warmup day 30; revisit once reputation recovers. - Dead M365 tenant scrub: HC defers are mostly
451 4.4.4against dead M365 tenants +421LuxSci throttle. Identify and suppress dead tenants.
Re-send of the never-delivered (unsigned) audience — Jun 22 2026
The ~79k cold sends made during the broken window (Jun 1 - Jun 18 00:31 UTC) were
stamped listmonk_sent_at at send time, so the builder permanently excluded them
even though they were junked/blocked unsigned. With DKIM now fixed we re-send to
the now-deliverable subset, excluding Gmail (Google consumer reputation is
still recovering) but including Microsoft/Hotmail (the bulk of the list).
What was done (all reversible):
MAIN_EXCLUDE_OPERATORSenv override added to the builder (commit5a3063e): when set it REPLACES the defaultWARMUP_EXCLUDE_OPERATORS. Set togooglein theworkersservice env so cold sends go to everything except Google, driving both the SQL exclude and the per-operator daily cap (google cap=0, others 120).- Backed up the reset target to
performancewest.resend_dkim_backup_20260622(6,461 rows = broken-window ANDemail_verify_result IN (smtp_valid, send_confirmed)ANDmx_provider <> google), thenUPDATE fmcsa_carriers SET listmonk_sent_at = NULLfor exactly those rows so the builder re-queues them. - Ran the builder with
--send-hour 17 --send-minute 30(the default per-tz hours 09-12 UTC were already past; Listmonk rejects a pastsend_atwith HTTP 400 "Scheduled date should be in the future" — always override the hour for a same-day manual re-run after the normal window). Result: 30 campaigns, queued_recipients=3000 (warmup cap), ~2,999 re-stamped. Provider mix: Microsoft 1,272 / Comcast / Charter / Proofpoint / long-tail; zero Google.
The remaining ~3.5k of the 6,461 backup set drain on subsequent daily runs under
the same cap. To revert a row: UPDATE fmcsa_carriers c SET listmonk_sent_at = b.old_listmonk_sent_at FROM resend_dkim_backup_20260622 b WHERE c.dot_number = b.dot_number;. To resume normal warmup exclusion later, unset
MAIN_EXCLUDE_OPERATORS (reverts to Google+Microsoft+consumer-MX held to day 30).
Incident: Jun 22 2026 — @TrackLink on per-subscriber CTAs = 404 + collapse
Symptom. The trucking "deficiency" CTA buttons (the primary order link and the
secondary DOT-check link) rendered as Listmonk tracking redirects
(https://lists.performancewest.net/link/<uuid>/...) that 404'd. The redirect
target (registered in links.url) was https://performancewest.net/order/boc3-filing&utm_source=...
— note the & with no ? — an invalid URL.
Root cause. Listmonk's @TrackLink marker registers one static URL per
tracked link and points every recipient's /link/<uuid> redirect at that single
row. This is fundamentally incompatible with a per-subscriber href such as
{{ .Subscriber.Attribs.lp_link }}&utm_source=...:
- The registered
links.urlwas captured with the{{ lp_link }}token dropped, yielding/order/slug&utm_source=...(first&, no?) → 404 for everyone. - Even if the URL had been valid, a static registration collapses every carrier
onto the first subscriber's
?dot=(or?npi=/?clia=) value — wrong order pre-fill for the entire blast.
By contrast, a static CTA (same URL for all recipients, e.g. the CRTC
?code=... order link) tracks correctly — keep @TrackLink there.
Why removing tracking loses nothing. Real human clicks are already attributed
via Umami's campaign-click event (bot-filtered by pw-bot-filter.js). Listmonk's
own click counters were already established as unreliable for this stream. So
Listmonk link tracking on per-subscriber CTAs is both redundant and destructive.
Fix — live (already-sent + in-flight mail). Rewrote the 10 broken registered
rows in place (replace the first & with ?) so the baked /link/<uuid> redirects
resolve. Backup table listmonk.pw_links_dkim_fix_bak_20260622 holds the old urls.
Verified the exact redirect that 404'd now returns 200 → lands on the (generic,
DOT-not-prefilled but fully functional) order page. To revert:
UPDATE links l SET url = b.url
FROM pw_links_dkim_fix_bak_20260622 b WHERE l.id = b.id; -- in the `listmonk` DB
Fix — source (future builds, the real fix). Stripped @TrackLink from every
per-subscriber / per-provider CTA so each row renders its own direct link (no
redirect, no collapse). Files changed:
scripts/create_deficiency_source_campaigns.py—_cta()(lp_link order button) and_dot_check_cta()(per-DOT tools link).data/trucking_campaigns/{ucr_annual_reminder,ifta_quarterly_reminder}.html(per-carrierlp_link).data/hc_campaigns/*.html(10 templates, per-provider?npi=/?clia=).lp_linkalready starts its query with?dot=(seelp_link_with_coupon()), so{{ lp_link }}&utm...renders to a valid per-carrier URL once the redirect is gone.
Healthcare note. The HC Listmonk DB (listmonk_hc) had 0 registered links
despite 13,425 sent — @TrackLink was not being stripped there at all, so the
literal @TrackLink shipped as harmless trailing text in utm_campaign and the
hrefs still 200'd (per-provider ?npi= was present literally in the template, not
via lp_link). No live HC breakage; source templates cleaned anyway to remove the
collapse risk on the next send.
Guardrail. Never put @TrackLink on an href containing a {{ .Subscriber... }}
token. Per-subscriber links must render directly; rely on Umami campaign-click
for human-click attribution.
Follow-up hardening — DONE (Jun 17-18 2026)
All discovered during the post-incident technical audit; each fix is codified.
- OpenDKIM not signing — fixed + codified in Ansible
roles/mail(commit4d59019). Foundational fix above. mail.logunbounded (~1 GB, no logrotate) — this host logs via Postfix's built-inpostlogd(no rsyslog), so a rename+create would strand the open inode. Added acopytruncatelogrotate rule (daily, 14-day, compressed) toroles/mail(commit2e4388a). Applied live, 1 GB archive compressed.- Plaintext (altbody) MIME part — all campaigns were HTML-only (a spam-score
signal; Listmonk only emits multipart/alternative when altbody is set). New
scripts/_email_plaintext.pyrenders a text/plain part from the HTML body (preserves Listmonk template tags, links -> "text (url)"); wired into the trucking builder (and thus UCR + IFTA) and the healthcare builder. Tests:scripts/test_email_plaintext.py. Commitsa32a3b0,4664601. @localhost.localdomainMessage-IDs — Listmonk derived the Message-ID from the random Docker container id. Pinned both listmonk + listmonk-hchostname: perfwest.performancewest.netindocker-compose.yml(matches the SMTPhello_hostname). Commita32a3b0.- Dead/legacy/satellite ISP scrub — added
DEAD_ISP_DOMAINS(52 domains, identified from our own Listmonk bounce table) toBLOCKED_EMAIL_DOMAINSin_email_exclusions.py, so every builder that imports it stops cold-mailing them. Deliberately keeps still-active large consumer ISPs (comcast/charter/ cox/centurylink) — their bounces were the no-DKIM problem, not dead mailboxes. Commitc183957. deploy@performancewest.netself-bounce — the deploy user's crontab held 3 jobs (payment_reminder, amb_location_scraper, renewal_worker) that are EXACT duplicates of systemd timers in theworker-cronsrole AND redirected to/var/log(which deploy cannot write), so they failed and cron mailed the error todeploy@(no mailbox -> self-bounce). Removed the redundant deploy crontab (backed up tologs/deploy-crontab.bak.*); the systemd timers carry the work. No IaC change needed (Ansible never created that crontab).- Entire campaign pipeline was not in IaC — the campaign cron builders, IP
warmup/ramp helpers, and bounce watchers lived ONLY on the host. New Ansible
mail-pipelinerole +playbooks/deploy-mail-pipeline.ymldeploy them all from the canonical repo copies (infra/cron/,infra/postfix/,infra/monitoring/,infra/systemd/,scripts/*bounce*). Commit4dc5690. - Telecom + transactional email was also HTML-only — the campaign-builder
plaintext fix (#3) only covered Listmonk mass-mail. The telecom/filing/
customer-transactional path (499-Q reminders, RMD/FCC filing review links,
intake/completion/delivery/commission emails, order confirmations) builds its
own
MIMEMultipart/ nodemailer messages, and ~17 of them attached ONLY an HTML part — a malformed single-partmultipart/alternativeand a spam signal. Fixed at the source so all callers are covered:scripts/workers/worker_email.pysend_worker_email()now auto-derives the text/plain part from HTML via_email_plaintext.html_to_textwhen the caller omitstext=.- 16 rolled-their-own Python senders (
scripts/workers/**,scripts/formation/ document_delivery.py) attach anhtml_to_text(...)plaintext sibling before the HTML part (job_server+document_deliverywrap text+html in analternativesub-part so PDF/DOCX still attach to themixedroot). api/src/email.tsgained a dependency-freehtmlToText()andsendEmailnow defaultstextto it (covers checkout/webhook HTML-only sends). NB: telecom campaigns themselves are still manually created+sent in the Listmonk UI (no send automation;compliance_alert_list.py/rmd_deficiency_campaign.pyonly populate lists). The one telecom send to date — campaign 407 "FCC Deficiency Report - FREEDOM249", Jun 08 — was HTML-only AND sent inside the DKIM-broken window: 384 sent / 343 views / 0 clicks (the same junked-mail signature as the trucking blasts). Any future telecom UI campaign should set an altbody (Listmonk "Plain text" toggle) and run through the same dead-ISP/suppression hygiene. Commitb375385.
INCIDENT 2026-06-24: warmed sending IPs dropped off the interface after reboot
Impact: ~37h of degraded deliverability + a near-zero sales day (Jun 24 04:04 -> Jun 25 17:25). Root cause was infrastructure, not reputation.
What happened. An unattended kernel upgrade rebooted the host at Jun 24 04:04
(6.12.90 -> 6.12.94). The warmed sending IPs .94 (trucking/out05) and .107
(HC/hcout1) are defined in /etc/network/interfaces, but classic ifupdown
(0.8.44) only applies the FIRST address line per stanza -- so only .71
(the primary) came back up. Postfix's smtp_bind_address=.94/.107 then failed
with warning: smtp_connect_addr: bind ...: Cannot assign requested address and
silently fell back to egressing from .71. .71 is (a) NOT in the SPF
record (v=spf1 ... ip4:.94 ip4:.107 -all) so every fallback message failed
SPF, and (b) listed on RLR621 + Trend Micro ERS-QIL, so receivers
deferred them (451 ... blacklisted - RLR621 - ip=<207.174.124.71>). Net: the
IP warming was bypassed and mail either failed SPF or got reputation-deferred.
Detection. Tail of /var/log/mail.log showed Cannot assign requested address (16,993 in one log) + deferrals citing ip=<207.174.124.71>.
ip -4 addr show ens18 showed only .71 bound (missing .72/.94/.107).
last reboot pinned the start to the 04:04 boot. Major RBLs (Spamhaus ZEN/DBL,
Barracuda, SpamCop, SORBS) were still clean for .94/.107 and the domain --
RLR621/ERS-QIL are proprietary soft listings keyed off .71/HELO and age off.
Fix (all applied 2026-06-25 ~17:25 CDT).
- Re-bound live:
ip addr add 207.174.124.{72,94,107}/23 dev ens18, thenpostqueue -f. - Reboot-persistence in
/etc/network/interfaces: added explicitup/down ip addr add/del ...hooks for the 3 secondaries (classic ifupdown ignores 2nd+addresslines; the hooks are honored). Backup at/etc/network/interfaces.bak-*. - Belt-and-suspenders systemd oneshot
pw-mail-ips.service(in repo atinfra/mail/pw-mail-ips.service) re-binds the IPs + flushes the queue on boot. - Watchdog cron
*/5pw-mail-ip-watchdog(repoinfra/mail/) re-binds any missing sending IP and flushes if it had to act or seesCannot assignlines.
Lesson / TODO. The host does unattended-upgrade reboots ~weekly (seen
05-25, 05-30, 06-24, all ~04:04). Any IP/transport change must be reboot-tested.
Consider migrating ifupdown -> netplan with all addresses, or pin
unattended-upgrades to skip auto-reboot. The mail_reputation_monitor.py
attributes egress to .71 as "transactional default" -- after this incident, a
spike of .71 egress in the bulk streams is itself an alarm.