docs: record post-incident email hardening (7 fixes) in runbook
This commit is contained in:
parent
466460112b
commit
4171f48736
1 changed files with 44 additions and 6 deletions
|
|
@ -164,10 +164,48 @@ sudo opendkim-testkey -d performancewest.net -s mail -vvv # expect "key OK"
|
|||
```
|
||||
|
||||
**Still TODO from this incident (list quality + content, not yet done):**
|
||||
- Scrub dead rural/satellite ISPs + dead M365 tenants from audiences and
|
||||
suppress repeat-deferring/bouncing domains (extend `_email_exclusions.py`).
|
||||
- Throttle/pause Gmail until reputation recovers (`550-5.7.1` was still firing).
|
||||
- Add a plaintext (altbody) MIME part — all campaigns are currently HTML-only,
|
||||
itself a spam signal.
|
||||
- Fix the self-bounce cron emailing the nonexistent `deploy@performancewest.net`
|
||||
(~700 self-inflicted `550` bounces/day).
|
||||
The trucking ramp/cap (`pw-listmonk-rampcap`) currently holds at 200/h and the
|
||||
builder excludes the big-MX operators (Google/Microsoft/...) until warmup
|
||||
day 30; revisit once reputation recovers.
|
||||
- Dead M365 tenant scrub: HC defers are mostly `451 4.4.4` against dead M365
|
||||
tenants + `421` LuxSci throttle. Identify and suppress dead tenants.
|
||||
|
||||
### Follow-up hardening — DONE (Jun 17-18 2026)
|
||||
|
||||
All discovered during the post-incident technical audit; each fix is codified.
|
||||
|
||||
1. **OpenDKIM not signing** — fixed + codified in Ansible `roles/mail`
|
||||
(commit `4d59019`). Foundational fix above.
|
||||
2. **`mail.log` unbounded (~1 GB, no logrotate)** — this host logs via Postfix's
|
||||
built-in `postlogd` (no rsyslog), so a rename+create would strand the open
|
||||
inode. Added a `copytruncate` logrotate rule (daily, 14-day, compressed) to
|
||||
`roles/mail` (commit `2e4388a`). Applied live, 1 GB archive compressed.
|
||||
3. **Plaintext (altbody) MIME part** — all campaigns were HTML-only (a spam-score
|
||||
signal; Listmonk only emits multipart/alternative when altbody is set). New
|
||||
`scripts/_email_plaintext.py` renders a text/plain part from the HTML body
|
||||
(preserves Listmonk template tags, links -> "text (url)"); wired into the
|
||||
trucking builder (and thus UCR + IFTA) and the healthcare builder. Tests:
|
||||
`scripts/test_email_plaintext.py`. Commits `a32a3b0`, `4664601`.
|
||||
4. **`@localhost.localdomain` Message-IDs** — Listmonk derived the Message-ID
|
||||
from the random Docker container id. Pinned both listmonk + listmonk-hc
|
||||
`hostname: perfwest.performancewest.net` in `docker-compose.yml` (matches the
|
||||
SMTP `hello_hostname`). Commit `a32a3b0`.
|
||||
5. **Dead/legacy/satellite ISP scrub** — added `DEAD_ISP_DOMAINS` (52 domains,
|
||||
identified from our own Listmonk bounce table) to `BLOCKED_EMAIL_DOMAINS` in
|
||||
`_email_exclusions.py`, so every builder that imports it stops cold-mailing
|
||||
them. Deliberately keeps still-active large consumer ISPs (comcast/charter/
|
||||
cox/centurylink) — their bounces were the no-DKIM problem, not dead mailboxes.
|
||||
Commit `c183957`.
|
||||
6. **`deploy@performancewest.net` self-bounce** — the deploy user's crontab held
|
||||
3 jobs (payment_reminder, amb_location_scraper, renewal_worker) that are
|
||||
EXACT duplicates of systemd timers in the `worker-crons` role AND redirected
|
||||
to `/var/log` (which deploy cannot write), so they failed and cron mailed the
|
||||
error to `deploy@` (no mailbox -> self-bounce). Removed the redundant deploy
|
||||
crontab (backed up to `logs/deploy-crontab.bak.*`); the systemd timers carry
|
||||
the work. No IaC change needed (Ansible never created that crontab).
|
||||
7. **Entire campaign pipeline was not in IaC** — the campaign cron builders, IP
|
||||
warmup/ramp helpers, and bounce watchers lived ONLY on the host. New Ansible
|
||||
`mail-pipeline` role + `playbooks/deploy-mail-pipeline.yml` deploy them all
|
||||
from the canonical repo copies (`infra/cron/`, `infra/postfix/`,
|
||||
`infra/monitoring/`, `infra/systemd/`, `scripts/*bounce*`). Commit `4dc5690`.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue