From 4171f48736f650125f3388b884f94ce63dba1f32 Mon Sep 17 00:00:00 2001 From: justin Date: Wed, 17 Jun 2026 20:30:59 -0500 Subject: [PATCH] docs: record post-incident email hardening (7 fixes) in runbook --- docs/email-deliverability-runbook.md | 50 ++++++++++++++++++++++++---- 1 file changed, 44 insertions(+), 6 deletions(-) diff --git a/docs/email-deliverability-runbook.md b/docs/email-deliverability-runbook.md index 843f1ae..e65ac6b 100644 --- a/docs/email-deliverability-runbook.md +++ b/docs/email-deliverability-runbook.md @@ -164,10 +164,48 @@ sudo opendkim-testkey -d performancewest.net -s mail -vvv # expect "key OK" ``` **Still TODO from this incident (list quality + content, not yet done):** -- Scrub dead rural/satellite ISPs + dead M365 tenants from audiences and - suppress repeat-deferring/bouncing domains (extend `_email_exclusions.py`). - Throttle/pause Gmail until reputation recovers (`550-5.7.1` was still firing). -- Add a plaintext (altbody) MIME part — all campaigns are currently HTML-only, - itself a spam signal. -- Fix the self-bounce cron emailing the nonexistent `deploy@performancewest.net` - (~700 self-inflicted `550` bounces/day). + The trucking ramp/cap (`pw-listmonk-rampcap`) currently holds at 200/h and the + builder excludes the big-MX operators (Google/Microsoft/...) until warmup + day 30; revisit once reputation recovers. +- Dead M365 tenant scrub: HC defers are mostly `451 4.4.4` against dead M365 + tenants + `421` LuxSci throttle. Identify and suppress dead tenants. + +### Follow-up hardening — DONE (Jun 17-18 2026) + +All discovered during the post-incident technical audit; each fix is codified. + +1. **OpenDKIM not signing** — fixed + codified in Ansible `roles/mail` + (commit `4d59019`). Foundational fix above. +2. **`mail.log` unbounded (~1 GB, no logrotate)** — this host logs via Postfix's + built-in `postlogd` (no rsyslog), so a rename+create would strand the open + inode. Added a `copytruncate` logrotate rule (daily, 14-day, compressed) to + `roles/mail` (commit `2e4388a`). Applied live, 1 GB archive compressed. +3. **Plaintext (altbody) MIME part** — all campaigns were HTML-only (a spam-score + signal; Listmonk only emits multipart/alternative when altbody is set). New + `scripts/_email_plaintext.py` renders a text/plain part from the HTML body + (preserves Listmonk template tags, links -> "text (url)"); wired into the + trucking builder (and thus UCR + IFTA) and the healthcare builder. Tests: + `scripts/test_email_plaintext.py`. Commits `a32a3b0`, `4664601`. +4. **`@localhost.localdomain` Message-IDs** — Listmonk derived the Message-ID + from the random Docker container id. Pinned both listmonk + listmonk-hc + `hostname: perfwest.performancewest.net` in `docker-compose.yml` (matches the + SMTP `hello_hostname`). Commit `a32a3b0`. +5. **Dead/legacy/satellite ISP scrub** — added `DEAD_ISP_DOMAINS` (52 domains, + identified from our own Listmonk bounce table) to `BLOCKED_EMAIL_DOMAINS` in + `_email_exclusions.py`, so every builder that imports it stops cold-mailing + them. Deliberately keeps still-active large consumer ISPs (comcast/charter/ + cox/centurylink) — their bounces were the no-DKIM problem, not dead mailboxes. + Commit `c183957`. +6. **`deploy@performancewest.net` self-bounce** — the deploy user's crontab held + 3 jobs (payment_reminder, amb_location_scraper, renewal_worker) that are + EXACT duplicates of systemd timers in the `worker-crons` role AND redirected + to `/var/log` (which deploy cannot write), so they failed and cron mailed the + error to `deploy@` (no mailbox -> self-bounce). Removed the redundant deploy + crontab (backed up to `logs/deploy-crontab.bak.*`); the systemd timers carry + the work. No IaC change needed (Ansible never created that crontab). +7. **Entire campaign pipeline was not in IaC** — the campaign cron builders, IP + warmup/ramp helpers, and bounce watchers lived ONLY on the host. New Ansible + `mail-pipeline` role + `playbooks/deploy-mail-pipeline.yml` deploy them all + from the canonical repo copies (`infra/cron/`, `infra/postfix/`, + `infra/monitoring/`, `infra/systemd/`, `scripts/*bounce*`). Commit `4dc5690`.