docs: record post-incident email hardening (7 fixes) in runbook

This commit is contained in:
justin 2026-06-17 20:30:59 -05:00
parent 466460112b
commit 4171f48736

View file

@ -164,10 +164,48 @@ sudo opendkim-testkey -d performancewest.net -s mail -vvv # expect "key OK"
```
**Still TODO from this incident (list quality + content, not yet done):**
- Scrub dead rural/satellite ISPs + dead M365 tenants from audiences and
suppress repeat-deferring/bouncing domains (extend `_email_exclusions.py`).
- Throttle/pause Gmail until reputation recovers (`550-5.7.1` was still firing).
- Add a plaintext (altbody) MIME part — all campaigns are currently HTML-only,
itself a spam signal.
- Fix the self-bounce cron emailing the nonexistent `deploy@performancewest.net`
(~700 self-inflicted `550` bounces/day).
The trucking ramp/cap (`pw-listmonk-rampcap`) currently holds at 200/h and the
builder excludes the big-MX operators (Google/Microsoft/...) until warmup
day 30; revisit once reputation recovers.
- Dead M365 tenant scrub: HC defers are mostly `451 4.4.4` against dead M365
tenants + `421` LuxSci throttle. Identify and suppress dead tenants.
### Follow-up hardening — DONE (Jun 17-18 2026)
All discovered during the post-incident technical audit; each fix is codified.
1. **OpenDKIM not signing** — fixed + codified in Ansible `roles/mail`
(commit `4d59019`). Foundational fix above.
2. **`mail.log` unbounded (~1 GB, no logrotate)** — this host logs via Postfix's
built-in `postlogd` (no rsyslog), so a rename+create would strand the open
inode. Added a `copytruncate` logrotate rule (daily, 14-day, compressed) to
`roles/mail` (commit `2e4388a`). Applied live, 1 GB archive compressed.
3. **Plaintext (altbody) MIME part** — all campaigns were HTML-only (a spam-score
signal; Listmonk only emits multipart/alternative when altbody is set). New
`scripts/_email_plaintext.py` renders a text/plain part from the HTML body
(preserves Listmonk template tags, links -> "text (url)"); wired into the
trucking builder (and thus UCR + IFTA) and the healthcare builder. Tests:
`scripts/test_email_plaintext.py`. Commits `a32a3b0`, `4664601`.
4. **`@localhost.localdomain` Message-IDs** — Listmonk derived the Message-ID
from the random Docker container id. Pinned both listmonk + listmonk-hc
`hostname: perfwest.performancewest.net` in `docker-compose.yml` (matches the
SMTP `hello_hostname`). Commit `a32a3b0`.
5. **Dead/legacy/satellite ISP scrub** — added `DEAD_ISP_DOMAINS` (52 domains,
identified from our own Listmonk bounce table) to `BLOCKED_EMAIL_DOMAINS` in
`_email_exclusions.py`, so every builder that imports it stops cold-mailing
them. Deliberately keeps still-active large consumer ISPs (comcast/charter/
cox/centurylink) — their bounces were the no-DKIM problem, not dead mailboxes.
Commit `c183957`.
6. **`deploy@performancewest.net` self-bounce** — the deploy user's crontab held
3 jobs (payment_reminder, amb_location_scraper, renewal_worker) that are
EXACT duplicates of systemd timers in the `worker-crons` role AND redirected
to `/var/log` (which deploy cannot write), so they failed and cron mailed the
error to `deploy@` (no mailbox -> self-bounce). Removed the redundant deploy
crontab (backed up to `logs/deploy-crontab.bak.*`); the systemd timers carry
the work. No IaC change needed (Ansible never created that crontab).
7. **Entire campaign pipeline was not in IaC** — the campaign cron builders, IP
warmup/ramp helpers, and bounce watchers lived ONLY on the host. New Ansible
`mail-pipeline` role + `playbooks/deploy-mail-pipeline.yml` deploy them all
from the canonical repo copies (`infra/cron/`, `infra/postfix/`,
`infra/monitoring/`, `infra/systemd/`, `scripts/*bounce*`). Commit `4dc5690`.