Listmonk @TrackLink registers ONE static URL per tracked link and points
every recipient's /link/<uuid> redirect at it. On per-subscriber hrefs
({{ lp_link }}, ?dot=, ?npi=, ?clia=) this is doubly broken:
- the registered links.url was captured before the {{ lp_link }} token
rendered, yielding /order/slug&utm_source=... (first &, no ?) -> 404
- even when valid it collapses every carrier/provider onto the first
subscriber's dot/npi/clia value
Real human clicks are already tracked via Umami campaign-click (bot
filtered), so Listmonk link tracking here is redundant and destructive.
Stripped @TrackLink from per-subscriber CTAs:
- scripts/create_deficiency_source_campaigns.py (_cta, _dot_check_cta)
- data/trucking_campaigns/{ucr,ifta}_*.html
- data/hc_campaigns/*.html (10 templates)
Static CTAs (e.g. CRTC ?code= order link) keep @TrackLink (safe).
Live fix to the 10 broken registered links.url rows applied separately
(first & -> ?), backup in listmonk.pw_links_dkim_fix_bak_20260622.
Docs: new runbook incident section + corrected the disproven
'use @TrackLink on all CTAs' guidance in fmcsa/hc plans.
324 lines
18 KiB
Markdown
324 lines
18 KiB
Markdown
# Email Deliverability & IP Warmup Runbook
|
|
|
|
Performance West self-hosts its outbound MTA (Postfix on the app server) because
|
|
transactional relays (SES, Postmark, SendGrid) forbid the cold prospecting email
|
|
our FMCSA trucking and telecom campaigns depend on. That means **we own our
|
|
sending-IP reputation** and must manage it manually. This doc is the operational
|
|
guide for keeping it healthy.
|
|
|
|
## Infrastructure layout
|
|
|
|
- **Host Postfix** on the app server (`207.174.124.71`), reached by Listmonk via
|
|
SMTP at `172.18.0.1:25`.
|
|
- **Sending IPs:** `207.174.124.90` through `.109` (20 IPs), each with valid
|
|
FCrDNS (`mtaNN.performancewest.net`) and authorized in SPF (`-all`).
|
|
- `.90` / `mta01`: historically a dedicated Yahoo trickle IP. We no longer mail
|
|
Yahoo at all, so it is idle.
|
|
- `.91-.109` / `mta02-mta20`: rotation pool, selected via
|
|
`transport_maps = hash:/etc/postfix/transport, randmap:{<active pool>}`.
|
|
- **Warmup scheduler:** `/usr/local/bin/pw-mta-warmup` (daily cron
|
|
`/etc/cron.d/pw-mta-warmup`, 07:17 UTC). Recomputes the active rotation pool
|
|
from a start date stamped in `/etc/postfix/pw-warmup-start`. Ramp schedule:
|
|
day 0-3 -> 3 IPs, 4-7 -> 5, 8-11 -> 8, 12-17 -> 12, 18-24 -> 16, 25+ -> 19.
|
|
The pool only ever grows. It picks IPs from the front of the `ALL=(...)` array.
|
|
|
|
## What we do NOT mail
|
|
|
|
The **Yahoo / Verizon-Media family** is excluded entirely (yahoo, aol, att,
|
|
verizon, frontier, sbcglobal, bellsouth, pacbell, ameritech, ymail, rocketmail,
|
|
aim, netscape, compuserve, etc.). They aggressively defer cold senders with
|
|
`421 4.7.0 [TSS04] ... unexpected volume or user complaints`, and that deferral
|
|
poisons the sending IP for Gmail and Microsoft too.
|
|
|
|
Enforced in two layers:
|
|
1. **Audience build** (authoritative): `scripts/_email_exclusions.py`
|
|
(`BLOCKED_EMAIL_DOMAINS`), imported by `build_trucking_campaigns.py` and
|
|
`populate_new_carrier_startup_campaign.py`. New campaigns never include them.
|
|
2. **Postfix backstop:** `/etc/postfix/transport` maps every Yahoo-family domain
|
|
to `hold:`. If any leak into the queue they are parked, never sent from a
|
|
rotation IP.
|
|
|
|
## Incident: May 30-31 2026 reputation collapse
|
|
|
|
A campaign blast pushed ~29k sends in a day across cold IPs `.91/.92/.93` with no
|
|
daily volume cap. Result:
|
|
- Gmail: `550-5.7.1 ... likely unsolicited mail` (hard spam block).
|
|
- Yahoo: `421 TSS04` on the rotation IPs.
|
|
- Steady state afterward: ~13% delivery (10k sent vs 68k deferred + 7k bounced
|
|
in a day). Listmonk open rate ~4%, clicks ~0.
|
|
|
|
### Remediation (Jun 02 2026)
|
|
- **Retired the 3 burned IPs** (`.91/.92/.93` = out02/03/04) from rotation.
|
|
Confirmed `.94-.109` had never sent outbound (only inbound port-scan noise),
|
|
so they are pristine.
|
|
- **Swapped rotation to fresh `.94/.95/.96`** (out05/06/07) and reset the warmup
|
|
start date to day 0.
|
|
- **Patched `pw-mta-warmup`** `ALL` array to start at `out05` so the daily cron
|
|
never reverts to the burned IPs.
|
|
- **Rewrote `/etc/postfix/transport`** to `hold:` the full Yahoo family (was a
|
|
partial list with buggy duplicate keys routing to `yahooslow`).
|
|
- **Flushed the entire stale queue** (1,846 blast-era messages, mostly dead
|
|
satellite ISPs) so fresh IPs start clean.
|
|
- **Enabled Listmonk sliding-window rate limit** so no campaign can blast again:
|
|
`app.message_sliding_window=true`, duration `1h`, rate `50`, `message_rate=2`.
|
|
- **Paused 19 trucking campaigns** (IDs 275-293, ~13k recipients) that were
|
|
scheduled to fire Jun 03; they were built before the exclusion fix and would
|
|
have re-torched the fresh IPs. Rebuild them small/clean before resending.
|
|
|
|
## Fresh-IP warmup discipline (the rules)
|
|
|
|
The historical mail.log proves these IPs sustain ~2,500 sends/day at 68-76%
|
|
delivery once warm (May 19-21). Collapses only ever came from 17k-29k spikes.
|
|
So we ramp ASSERTIVELY but never spike. The Listmonk sliding-window cap
|
|
(`/usr/local/bin/pw-listmonk-rampcap`, daily cron 07:20 UTC, driven off the same
|
|
`/etc/postfix/pw-warmup-start` stamp) enforces this automatically:
|
|
|
|
| warmup day | hourly cap | ~daily total |
|
|
|-----------:|-----------:|-------------:|
|
|
| 0-1 | 50/h | ~500 |
|
|
| 2-3 | 150/h | ~1,500 |
|
|
| 4-6 | 250/h | ~2,500 |
|
|
| 7+ | 300/h | ~3,000 (hard ceiling) |
|
|
|
|
Hard rule from the data: **never exceed ~4k/day, never spike.**
|
|
|
|
Other rules:
|
|
1. **Best recipients first.** Gmail + Microsoft + clean ISPs only (Yahoo family
|
|
already excluded). Send small focused batches, e.g.
|
|
`build_trucking_campaigns --only-segment mcs150 --max-per-segment 100 --date <today> --send-hour <H>`.
|
|
2. **Scrub hard bounces immediately.** `550 5.1.1`, full mailbox, "not our
|
|
customer" all hurt reputation signals.
|
|
3. **Watch the signals daily** (see commands below). If Gmail `550-5.7.1` or
|
|
Yahoo `421 TSS04` reappear, STOP and hold for several days.
|
|
|
|
## Monitoring commands
|
|
|
|
```bash
|
|
# delivery mix today
|
|
sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -oE 'status=(sent|deferred|bounced)' | sort | uniq -c
|
|
|
|
# per-IP outbound volume today (catch a runaway blast early)
|
|
for ip in 94 95 96; do echo -n ".$ip: "; sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep -c "207.174.124.$ip"; done
|
|
|
|
# top deferral / bounce reasons today
|
|
sudo grep "^$(date '+%b %d')" /var/log/mail.log | grep status=deferred | grep -oE 'said: [0-9]{3}[^)]{0,50}' | sort | uniq -c | sort -rn | head
|
|
|
|
# queue size
|
|
sudo postqueue -p | tail -1
|
|
|
|
# active rotation pool + warmup day
|
|
sudo postconf -h transport_maps
|
|
echo $(( ($(date +%s) - $(sudo cat /etc/postfix/pw-warmup-start)) / 86400 ))
|
|
```
|
|
|
|
## Backups left on the server (Jun 02 2026 remediation)
|
|
- `/etc/postfix/main.cf.bak.*`
|
|
- `/etc/postfix/transport.bak.*`
|
|
- `/usr/local/bin/pw-mta-warmup.bak.*`
|
|
|
|
## Incident: Jun 17 2026 — campaign mail sent UNSIGNED (no DKIM)
|
|
|
|
**Symptom:** "no new sales." Campaigns were sending (~3-4k/day) but delivery was
|
|
~23% (sent 1,802 vs deferred 5,143 + bounced 580), Gmail returned `550-5.7.1
|
|
likely unsolicited mail`, and there were **zero clicks since Jun 8** despite
|
|
~600 opens/day.
|
|
|
|
**Root cause:** OpenDKIM was signing **nothing** that came from Listmonk.
|
|
`/etc/opendkim.conf` was in single-key mode with **no `InternalHosts`**, so it
|
|
defaulted to signing only `127.0.0.1`. Cron/transactional mail is injected
|
|
locally (127.0.0.1) so it WAS signed — but campaign mail is injected over the
|
|
Docker bridge from the Listmonk containers (`172.18.0.5` trucking,
|
|
`172.18.0.25` healthcare). Those clients were not "internal," so OpenDKIM
|
|
*verified* (instead of *signed*) them: every cold email went out **unsigned**.
|
|
Since Feb 2024 Gmail/Yahoo require DKIM on bulk mail, so unsigned campaigns were
|
|
junked/blocked. Proof: `2,620` campaign messages that day, `0` "DKIM-Signature
|
|
field added" events, while the every-5-min cron mail was signed.
|
|
|
|
The correct table files already existed (`/etc/opendkim/{key.table,
|
|
signing.table,trusted.hosts}`, and `trusted.hosts` already listed
|
|
`172.16.0.0/12`) — they were simply **never wired into `opendkim.conf`**.
|
|
|
|
**Fix (now codified in Ansible `roles/mail`):** point `opendkim.conf` at the
|
|
tables and set the signing scope —
|
|
```
|
|
KeyTable refile:/etc/opendkim/key.table
|
|
SigningTable refile:/etc/opendkim/signing.table
|
|
InternalHosts /etc/opendkim/trusted.hosts # includes 172.16.0.0/12 (Docker)
|
|
ExternalIgnoreList /etc/opendkim/trusted.hosts
|
|
OversignHeaders From
|
|
```
|
|
then `systemctl restart opendkim`. This fixes BOTH streams at once: the
|
|
healthcare submission instances (ports 2526-2528) inherit the global
|
|
`smtpd_milters` and the `*@performancewest.net` signing table covers
|
|
`compliance@`. Verified by injecting a message from a Docker IP through both
|
|
port 25 and port 2526 and confirming "DKIM-Signature field added" for each.
|
|
|
|
**Verify DKIM is actually signing campaign mail:**
|
|
```bash
|
|
# Should be NON-ZERO and roughly track campaign volume:
|
|
sudo journalctl -u opendkim --since today | grep -c 'DKIM-Signature field added'
|
|
# Cross-check: campaign cleanup events today (should be similar order of magnitude)
|
|
sudo grep "^$(date '+%b %e')" /var/log/mail.log | grep -c postfix/cleanup
|
|
# Key still matches published DNS:
|
|
sudo opendkim-testkey -d performancewest.net -s mail -vvv # expect "key OK"
|
|
```
|
|
|
|
**Still TODO from this incident (list quality + content, not yet done):**
|
|
- Throttle/pause Gmail until reputation recovers (`550-5.7.1` was still firing).
|
|
The trucking ramp/cap (`pw-listmonk-rampcap`) currently holds at 200/h and the
|
|
builder excludes the big-MX operators (Google/Microsoft/...) until warmup
|
|
day 30; revisit once reputation recovers.
|
|
- Dead M365 tenant scrub: HC defers are mostly `451 4.4.4` against dead M365
|
|
tenants + `421` LuxSci throttle. Identify and suppress dead tenants.
|
|
|
|
### Re-send of the never-delivered (unsigned) audience — Jun 22 2026
|
|
|
|
The ~79k cold sends made during the broken window (Jun 1 - Jun 18 00:31 UTC) were
|
|
stamped `listmonk_sent_at` at send time, so the builder permanently excluded them
|
|
even though they were junked/blocked unsigned. With DKIM now fixed we re-send to
|
|
the now-deliverable subset, **excluding Gmail** (Google consumer reputation is
|
|
still recovering) but **including Microsoft/Hotmail** (the bulk of the list).
|
|
|
|
What was done (all reversible):
|
|
1. `MAIN_EXCLUDE_OPERATORS` env override added to the builder (commit `5a3063e`):
|
|
when set it REPLACES the default `WARMUP_EXCLUDE_OPERATORS`. Set to `google` in
|
|
the `workers` service env so cold sends go to everything except Google, driving
|
|
both the SQL exclude and the per-operator daily cap (google cap=0, others 120).
|
|
2. Backed up the reset target to `performancewest.resend_dkim_backup_20260622`
|
|
(6,461 rows = broken-window AND `email_verify_result IN (smtp_valid,
|
|
send_confirmed)` AND `mx_provider <> google`), then `UPDATE fmcsa_carriers SET
|
|
listmonk_sent_at = NULL` for exactly those rows so the builder re-queues them.
|
|
3. Ran the builder with `--send-hour 17 --send-minute 30` (the default per-tz hours
|
|
09-12 UTC were already past; **Listmonk rejects a past `send_at` with HTTP 400
|
|
"Scheduled date should be in the future"** — always override the hour for a
|
|
same-day manual re-run after the normal window). Result: 30 campaigns,
|
|
queued_recipients=3000 (warmup cap), ~2,999 re-stamped. Provider mix: Microsoft
|
|
1,272 / Comcast / Charter / Proofpoint / long-tail; **zero Google**.
|
|
|
|
The remaining ~3.5k of the 6,461 backup set drain on subsequent daily runs under
|
|
the same cap. To revert a row: `UPDATE fmcsa_carriers c SET listmonk_sent_at =
|
|
b.old_listmonk_sent_at FROM resend_dkim_backup_20260622 b WHERE c.dot_number =
|
|
b.dot_number;`. To resume normal warmup exclusion later, unset
|
|
`MAIN_EXCLUDE_OPERATORS` (reverts to Google+Microsoft+consumer-MX held to day 30).
|
|
|
|
### Incident: Jun 22 2026 — `@TrackLink` on per-subscriber CTAs = 404 + collapse
|
|
|
|
**Symptom.** The trucking "deficiency" CTA buttons (the primary order link and the
|
|
secondary DOT-check link) rendered as Listmonk tracking redirects
|
|
(`https://lists.performancewest.net/link/<uuid>/...`) that **404'd**. The redirect
|
|
target (registered in `links.url`) was `https://performancewest.net/order/boc3-filing&utm_source=...`
|
|
— note the `&` with **no `?`** — an invalid URL.
|
|
|
|
**Root cause.** Listmonk's `@TrackLink` marker registers **one static URL per
|
|
tracked link** and points every recipient's `/link/<uuid>` redirect at that single
|
|
row. This is fundamentally incompatible with a **per-subscriber** href such as
|
|
`{{ .Subscriber.Attribs.lp_link }}&utm_source=...`:
|
|
- The registered `links.url` was captured with the `{{ lp_link }}` token dropped,
|
|
yielding `/order/slug&utm_source=...` (first `&`, no `?`) → **404 for everyone**.
|
|
- Even if the URL had been valid, a static registration **collapses every carrier
|
|
onto the first subscriber's** `?dot=` (or `?npi=`/`?clia=`) value — wrong order
|
|
pre-fill for the entire blast.
|
|
|
|
By contrast, a **static** CTA (same URL for all recipients, e.g. the CRTC
|
|
`?code=...` order link) tracks correctly — keep `@TrackLink` there.
|
|
|
|
**Why removing tracking loses nothing.** Real human clicks are already attributed
|
|
via Umami's `campaign-click` event (bot-filtered by `pw-bot-filter.js`). Listmonk's
|
|
own click counters were already established as unreliable for this stream. So
|
|
Listmonk link tracking on per-subscriber CTAs is both redundant and destructive.
|
|
|
|
**Fix — live (already-sent + in-flight mail).** Rewrote the 10 broken registered
|
|
rows in place (replace the first `&` with `?`) so the baked `/link/<uuid>` redirects
|
|
resolve. Backup table `listmonk.pw_links_dkim_fix_bak_20260622` holds the old urls.
|
|
Verified the exact redirect that 404'd now returns 200 → lands on the (generic,
|
|
DOT-not-prefilled but fully functional) order page. To revert:
|
|
```sql
|
|
UPDATE links l SET url = b.url
|
|
FROM pw_links_dkim_fix_bak_20260622 b WHERE l.id = b.id; -- in the `listmonk` DB
|
|
```
|
|
|
|
**Fix — source (future builds, the real fix).** Stripped `@TrackLink` from every
|
|
**per-subscriber / per-provider** CTA so each row renders its own direct link (no
|
|
redirect, no collapse). Files changed:
|
|
- `scripts/create_deficiency_source_campaigns.py` — `_cta()` (lp_link order button)
|
|
and `_dot_check_cta()` (per-DOT tools link).
|
|
- `data/trucking_campaigns/{ucr_annual_reminder,ifta_quarterly_reminder}.html`
|
|
(per-carrier `lp_link`).
|
|
- `data/hc_campaigns/*.html` (10 templates, per-provider `?npi=`/`?clia=`).
|
|
`lp_link` already starts its query with `?dot=` (see `lp_link_with_coupon()`), so
|
|
`{{ lp_link }}&utm...` renders to a valid per-carrier URL once the redirect is gone.
|
|
|
|
**Healthcare note.** The HC Listmonk DB (`listmonk_hc`) had **0 registered links**
|
|
despite 13,425 sent — `@TrackLink` was not being stripped there at all, so the
|
|
literal `@TrackLink` shipped as harmless trailing text in `utm_campaign` and the
|
|
hrefs still 200'd (per-provider `?npi=` was present literally in the template, not
|
|
via lp_link). No live HC breakage; source templates cleaned anyway to remove the
|
|
collapse risk on the next send.
|
|
|
|
**Guardrail.** Never put `@TrackLink` on an href containing a `{{ .Subscriber... }}`
|
|
token. Per-subscriber links must render directly; rely on Umami `campaign-click`
|
|
for human-click attribution.
|
|
|
|
|
|
### Follow-up hardening — DONE (Jun 17-18 2026)
|
|
|
|
All discovered during the post-incident technical audit; each fix is codified.
|
|
|
|
1. **OpenDKIM not signing** — fixed + codified in Ansible `roles/mail`
|
|
(commit `4d59019`). Foundational fix above.
|
|
2. **`mail.log` unbounded (~1 GB, no logrotate)** — this host logs via Postfix's
|
|
built-in `postlogd` (no rsyslog), so a rename+create would strand the open
|
|
inode. Added a `copytruncate` logrotate rule (daily, 14-day, compressed) to
|
|
`roles/mail` (commit `2e4388a`). Applied live, 1 GB archive compressed.
|
|
3. **Plaintext (altbody) MIME part** — all campaigns were HTML-only (a spam-score
|
|
signal; Listmonk only emits multipart/alternative when altbody is set). New
|
|
`scripts/_email_plaintext.py` renders a text/plain part from the HTML body
|
|
(preserves Listmonk template tags, links -> "text (url)"); wired into the
|
|
trucking builder (and thus UCR + IFTA) and the healthcare builder. Tests:
|
|
`scripts/test_email_plaintext.py`. Commits `a32a3b0`, `4664601`.
|
|
4. **`@localhost.localdomain` Message-IDs** — Listmonk derived the Message-ID
|
|
from the random Docker container id. Pinned both listmonk + listmonk-hc
|
|
`hostname: perfwest.performancewest.net` in `docker-compose.yml` (matches the
|
|
SMTP `hello_hostname`). Commit `a32a3b0`.
|
|
5. **Dead/legacy/satellite ISP scrub** — added `DEAD_ISP_DOMAINS` (52 domains,
|
|
identified from our own Listmonk bounce table) to `BLOCKED_EMAIL_DOMAINS` in
|
|
`_email_exclusions.py`, so every builder that imports it stops cold-mailing
|
|
them. Deliberately keeps still-active large consumer ISPs (comcast/charter/
|
|
cox/centurylink) — their bounces were the no-DKIM problem, not dead mailboxes.
|
|
Commit `c183957`.
|
|
6. **`deploy@performancewest.net` self-bounce** — the deploy user's crontab held
|
|
3 jobs (payment_reminder, amb_location_scraper, renewal_worker) that are
|
|
EXACT duplicates of systemd timers in the `worker-crons` role AND redirected
|
|
to `/var/log` (which deploy cannot write), so they failed and cron mailed the
|
|
error to `deploy@` (no mailbox -> self-bounce). Removed the redundant deploy
|
|
crontab (backed up to `logs/deploy-crontab.bak.*`); the systemd timers carry
|
|
the work. No IaC change needed (Ansible never created that crontab).
|
|
7. **Entire campaign pipeline was not in IaC** — the campaign cron builders, IP
|
|
warmup/ramp helpers, and bounce watchers lived ONLY on the host. New Ansible
|
|
`mail-pipeline` role + `playbooks/deploy-mail-pipeline.yml` deploy them all
|
|
from the canonical repo copies (`infra/cron/`, `infra/postfix/`,
|
|
`infra/monitoring/`, `infra/systemd/`, `scripts/*bounce*`). Commit `4dc5690`.
|
|
8. **Telecom + transactional email was also HTML-only** — the campaign-builder
|
|
plaintext fix (#3) only covered Listmonk mass-mail. The telecom/filing/
|
|
customer-transactional path (499-Q reminders, RMD/FCC filing review links,
|
|
intake/completion/delivery/commission emails, order confirmations) builds its
|
|
own `MIMEMultipart` / nodemailer messages, and ~17 of them attached ONLY an
|
|
HTML part — a malformed single-part `multipart/alternative` and a spam signal.
|
|
Fixed at the source so all callers are covered:
|
|
- `scripts/workers/worker_email.py` `send_worker_email()` now auto-derives the
|
|
text/plain part from HTML via `_email_plaintext.html_to_text` when the
|
|
caller omits `text=`.
|
|
- 16 rolled-their-own Python senders (`scripts/workers/**`, `scripts/formation/
|
|
document_delivery.py`) attach an `html_to_text(...)` plaintext sibling
|
|
before the HTML part (`job_server` + `document_delivery` wrap text+html in an
|
|
`alternative` sub-part so PDF/DOCX still attach to the `mixed` root).
|
|
- `api/src/email.ts` gained a dependency-free `htmlToText()` and `sendEmail`
|
|
now defaults `text` to it (covers checkout/webhook HTML-only sends).
|
|
NB: telecom campaigns themselves are still **manually** created+sent in the
|
|
Listmonk UI (no send automation; `compliance_alert_list.py` /
|
|
`rmd_deficiency_campaign.py` only populate lists). The one telecom send to
|
|
date — campaign 407 "FCC Deficiency Report - FREEDOM249", Jun 08 — was
|
|
HTML-only AND sent inside the DKIM-broken window: 384 sent / 343 views / **0
|
|
clicks** (the same junked-mail signature as the trucking blasts). Any future
|
|
telecom UI campaign should set an altbody (Listmonk "Plain text" toggle) and
|
|
run through the same dead-ISP/suppression hygiene. Commit `b375385`.
|