docs: email deliverability incident timeline (May-June 2026)

This commit is contained in:
justin 2026-06-27 15:10:32 -05:00
parent 6b2cf5a07b
commit f163ccea92

View file

@ -0,0 +1,131 @@
# Email Deliverability - Incident & Issue Timeline (MayJune 2026)
Listmonk + Postfix (self-hosted MTA) cold-outreach for trucking (IP .94) and
healthcare (IP .107) on host 207.174.124.71. Dates are when the issue was
identified/fixed; root causes often predate the fix.
---
## May
- **2026-05-21** - Rebuilt the Listmonk bounce-sync after unreliable webhook
delivery (Listmonk silently drops bounces it can't FK-match to a subscriber).
Switched to log-scraping `/var/log/mail.log` and inserting with real
subscriber IDs. (commit `ba2f6eb`)
- **2026-05-30** - ⭐ **The DKIM disaster blast.** A large trucking blast went
out with **broken DKIM signing**, so receivers applied DMARC auth-policy
rejection. **7,634 "hard bounces" in one day** - but ~6,604 were DSN `5.7.1`
(DMARC/policy failures, *not* bad mailboxes); only ~221 were real dead
mailboxes (`5.1.1`). This is the event that poisoned the carrier list.
- **2026-05-31** → following weeks - Fallout: Listmonk auto-blocklisted on the
**first** hard bounce, and the bounce-sync's own SQL also blocklisted on the
first 5xx of *any* DSN. Result: **~17,000 carriers wrongly blocklisted** (88%
of the list) over the broken-DKIM window. Not discovered as a false-positive
until late June.
---
## June - root-cause fixes to the sending stack
- **2026-06-14** - Per-MX-operator throttling added; Google / Microsoft 365
(Workspace) excluded from warmup sends. HC warmup corrected to run **daily**
for the full 21-day ramp (was weekdays-only, stretching the ramp). (`9e40965`,
`2caab6a`)
- **2026-06-16** - Stopped blasting trucking to `mx_unreachable` dead domains;
the verifier was mislabeling live big-ISP mailboxes as unreachable. Suppressed
defunct/legacy/satellite ISP domains in cold sends. (`1652a3b`, `1eb29f8`,
`c183957`)
- **2026-06-17** - ⭐ **Root DKIM fix.** Found OpenDKIM was **not signing**
campaign mail (the Docker-injected path bypassed signing); fixed and codified
in Ansible. Also: added a `text/plain` MIME part to every email (spam-filter
requirement), stable Message-ID hostname, Postfix `mail.log` logrotate,
decommissioned SMTP2GO (local MTA only). (`4d59019`, `a32a3b0`, `b375385`,
`2e4388a`, `a04ecf7`)
- **2026-06-18** - ⭐ Moved bulk campaigns to a **dedicated subdomain**
`send.performancewest.net` (protects the root domain's reputation); Ansible
signs it. **Killed the snowshoe IP pattern** now that DKIM works (consolidated
sending IPs). Excluded Apple/iCloud consumer mail; began scrubbing stale
consumer subscribers from Listmonk. Catch-all pool auto-rollout gated by
warmup-day + live bounce rate. (`5c3b429`, `545e6f7`, `b40fc7e`, `40da017`)
- **2026-06-19** - Removed 18 dormant **snowshoe IPs** from Postfix + host.
Built a **mail-reputation monitor** (SNDS-equivalent from Postfix logs) +
nightly snapshot cron. Stood up **DMARC aggregate-report ingestion** (dedicated
`dmarc@` mailbox + parser); classified the whole `207.174.124.0/24` as ours.
(`9dd6f53`, `08f651d`, `b45332b`, `8e5590b`, `707d538`)
- **2026-06-20** - Bounded the untagged (NULL `mx_provider`) bucket in the
selector and closed **MX-exclusion gaps** (consumer MX operators were leaking
into cold sends); added an MX-tagging cron. (`9eeed47`, `bc93d93`)
- **2026-06-21** - Fixed the **Reply-To header shape** - Listmonk was silently
dropping a malformed Reply-To. (`e414ec4`)
- **2026-06-22** - ⭐ **Post-DKIM re-send** to the list, with a **Gmail-only
exclusion** (Gmail still distrusted the warming domain). Stepped the trucking
rate cap back up to 400/h (day 1920), 500/h ceiling. (`5a3063e`, `1e9dcfc`)
- **2026-06-22/23** - Fixed broken CTAs in trucking email: a recurring
`@TrackLink` **404** + link-collapse bug, and order CTAs pointing at the wrong
($399 catch-all) service page. (`3325259`, `e3f4392`, `a90cdc9`)
- **2026-06-24** - ⭐ **Sending-IP outage.** The warmed sending IPs **dropped
off interface `ens18` on reboot**, so mail stopped/misrouted. Fixed to persist
across reboots. Also repaired two dead mail-alert crons + de-noised the DMARC
digest. (`4276ada`, `ae68edb`)
- **2026-06-26** - ⭐ **Volume whipsaw fixed.** The catch-all guardrail used a
**2-day** bounce window; one bad batch (Jun 24: 465 sent / 10.75%) flipped
catch-all OFF, starving volume so badly it couldn't gather a 300-send sample to
re-enable - a self-reinforcing trap. Widened the window **2d → 5d**. Also fixed
the HC cron **re-mailing the whole list daily** (added per-day send lists).
(`f344287`, `b350a13`)
- **2026-06-26** - ⭐⭐ **The re-blocklist bomb.** Discovered `listmonk-bounce-sync`
(root cron, every 5 min) was blocklisting carriers on the **first hard bounce
of *any* 5xx DSN** via direct SQL - bypassing Listmonk's own threshold. *This*
is the mechanism that wrongly killed ~17,000 carriers in May. Rewrote it: only
genuine bad-mailbox DSNs (5.1.1/5.1.10/5.1.0/5.0.0/5.4.1/5.5.0) count, and it
now requires **≥3 distinct hard bounces**. Reputation/policy 5.7.x and
quota/greylist 5.2.x never trigger a blocklist. (`bfdbf8f`)
- **2026-06-27** - ⭐ **Wrongly-blocklisted recovery send (campaign 727).**
Un-blocklisted 4,317 false-positive carriers (excluding the ~688 real dead
mailboxes), re-sent with a fresh 30%-off coupon. Verified the bounce-sync fix
held live: 727 took ~61 hard bounces but **0 carriers re-blocklisted**.
- **2026-06-27** - ⭐⭐⭐ **Disk-full Postgres crash, mid-send.** `/` hit **100%**
(orphaned 15GB forgejo backup dump + uncapped Docker logs), Postgres
crash-looped on "No space left on device", and the Listmonk container was
destroyed mid-campaign. Recovered (pruned build cache + dumps + orphan volumes:
100% → 72%, 62GB free), recreated Listmonk, campaigns auto-resumed. Added a
**Docker log cap** (50m×3) and a **disk-space monitor** (Telegram warn at 90%,
auto-reclaim at 94%) - neither existed before. (`e318f12`, `6b2cf5a`)
- **2026-06-27** - ⭐ **/24 RBL listing.** The whole `207.174.124.0/24` block
got listed on **invaluement** (ivmSIP + ivmSIP/24) - affects ~11% of
recipients (Intermedia/securence business domains); **Spamhaus / Barracuda /
SpamCop all clean**, so Gmail/Microsoft/Yahoo unaffected. Dialed catch-all back
to smtp_valid-only and submitted a delist request (propagation pending). Also
noted ~73 "very low reputation" rejects are **Google-Workspace custom domains**
the `@gmail.com` filter misses.
---
## Cross-cutting root causes (the recurring themes)
1. **DKIM not signing (May)** → DMARC rejections misread as hard bounces →
mass false-positive blocklisting. The single most damaging issue.
2. **Over-aggressive blocklisting logic** (both Listmonk's count:1 default *and*
the bounce-sync's own SQL) turned transient/policy bounces into permanent
list death.
3. **Reputation/warmup fragility** - snowshoe IPs, missing dedicated subdomain,
consumer-domain leakage, big-MX (Google/MS) sensitivity during warmup.
4. **Operational guardrails too twitchy or missing** - 2-day catch-all window
whipsaw; no disk monitoring; uncapped Docker logs; IPs dropping off NIC on
reboot; dead alert crons.