From f163ccea92e5189eede2380e32bae1212ac3c292 Mon Sep 17 00:00:00 2001 From: justin Date: Sat, 27 Jun 2026 15:10:32 -0500 Subject: [PATCH] docs: email deliverability incident timeline (May-June 2026) --- .../email-deliverability-incident-timeline.md | 131 ++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 docs/email-deliverability-incident-timeline.md diff --git a/docs/email-deliverability-incident-timeline.md b/docs/email-deliverability-incident-timeline.md new file mode 100644 index 0000000..e8c8feb --- /dev/null +++ b/docs/email-deliverability-incident-timeline.md @@ -0,0 +1,131 @@ +# Email Deliverability - Incident & Issue Timeline (May–June 2026) + +Listmonk + Postfix (self-hosted MTA) cold-outreach for trucking (IP .94) and +healthcare (IP .107) on host 207.174.124.71. Dates are when the issue was +identified/fixed; root causes often predate the fix. + +--- + +## May + +- **2026-05-21** - Rebuilt the Listmonk bounce-sync after unreliable webhook + delivery (Listmonk silently drops bounces it can't FK-match to a subscriber). + Switched to log-scraping `/var/log/mail.log` and inserting with real + subscriber IDs. (commit `ba2f6eb`) + +- **2026-05-30** - ⭐ **The DKIM disaster blast.** A large trucking blast went + out with **broken DKIM signing**, so receivers applied DMARC auth-policy + rejection. **7,634 "hard bounces" in one day** - but ~6,604 were DSN `5.7.1` + (DMARC/policy failures, *not* bad mailboxes); only ~221 were real dead + mailboxes (`5.1.1`). This is the event that poisoned the carrier list. + +- **2026-05-31** → following weeks - Fallout: Listmonk auto-blocklisted on the + **first** hard bounce, and the bounce-sync's own SQL also blocklisted on the + first 5xx of *any* DSN. Result: **~17,000 carriers wrongly blocklisted** (88% + of the list) over the broken-DKIM window. Not discovered as a false-positive + until late June. + +--- + +## June - root-cause fixes to the sending stack + +- **2026-06-14** - Per-MX-operator throttling added; Google / Microsoft 365 + (Workspace) excluded from warmup sends. HC warmup corrected to run **daily** + for the full 21-day ramp (was weekdays-only, stretching the ramp). (`9e40965`, + `2caab6a`) + +- **2026-06-16** - Stopped blasting trucking to `mx_unreachable` dead domains; + the verifier was mislabeling live big-ISP mailboxes as unreachable. Suppressed + defunct/legacy/satellite ISP domains in cold sends. (`1652a3b`, `1eb29f8`, + `c183957`) + +- **2026-06-17** - ⭐ **Root DKIM fix.** Found OpenDKIM was **not signing** + campaign mail (the Docker-injected path bypassed signing); fixed and codified + in Ansible. Also: added a `text/plain` MIME part to every email (spam-filter + requirement), stable Message-ID hostname, Postfix `mail.log` logrotate, + decommissioned SMTP2GO (local MTA only). (`4d59019`, `a32a3b0`, `b375385`, + `2e4388a`, `a04ecf7`) + +- **2026-06-18** - ⭐ Moved bulk campaigns to a **dedicated subdomain** + `send.performancewest.net` (protects the root domain's reputation); Ansible + signs it. **Killed the snowshoe IP pattern** now that DKIM works (consolidated + sending IPs). Excluded Apple/iCloud consumer mail; began scrubbing stale + consumer subscribers from Listmonk. Catch-all pool auto-rollout gated by + warmup-day + live bounce rate. (`5c3b429`, `545e6f7`, `b40fc7e`, `40da017`) + +- **2026-06-19** - Removed 18 dormant **snowshoe IPs** from Postfix + host. + Built a **mail-reputation monitor** (SNDS-equivalent from Postfix logs) + + nightly snapshot cron. Stood up **DMARC aggregate-report ingestion** (dedicated + `dmarc@` mailbox + parser); classified the whole `207.174.124.0/24` as ours. + (`9dd6f53`, `08f651d`, `b45332b`, `8e5590b`, `707d538`) + +- **2026-06-20** - Bounded the untagged (NULL `mx_provider`) bucket in the + selector and closed **MX-exclusion gaps** (consumer MX operators were leaking + into cold sends); added an MX-tagging cron. (`9eeed47`, `bc93d93`) + +- **2026-06-21** - Fixed the **Reply-To header shape** - Listmonk was silently + dropping a malformed Reply-To. (`e414ec4`) + +- **2026-06-22** - ⭐ **Post-DKIM re-send** to the list, with a **Gmail-only + exclusion** (Gmail still distrusted the warming domain). Stepped the trucking + rate cap back up to 400/h (day 19–20), 500/h ceiling. (`5a3063e`, `1e9dcfc`) + +- **2026-06-22/23** - Fixed broken CTAs in trucking email: a recurring + `@TrackLink` **404** + link-collapse bug, and order CTAs pointing at the wrong + ($399 catch-all) service page. (`3325259`, `e3f4392`, `a90cdc9`) + +- **2026-06-24** - ⭐ **Sending-IP outage.** The warmed sending IPs **dropped + off interface `ens18` on reboot**, so mail stopped/misrouted. Fixed to persist + across reboots. Also repaired two dead mail-alert crons + de-noised the DMARC + digest. (`4276ada`, `ae68edb`) + +- **2026-06-26** - ⭐ **Volume whipsaw fixed.** The catch-all guardrail used a + **2-day** bounce window; one bad batch (Jun 24: 465 sent / 10.75%) flipped + catch-all OFF, starving volume so badly it couldn't gather a 300-send sample to + re-enable - a self-reinforcing trap. Widened the window **2d → 5d**. Also fixed + the HC cron **re-mailing the whole list daily** (added per-day send lists). + (`f344287`, `b350a13`) + +- **2026-06-26** - ⭐⭐ **The re-blocklist bomb.** Discovered `listmonk-bounce-sync` + (root cron, every 5 min) was blocklisting carriers on the **first hard bounce + of *any* 5xx DSN** via direct SQL - bypassing Listmonk's own threshold. *This* + is the mechanism that wrongly killed ~17,000 carriers in May. Rewrote it: only + genuine bad-mailbox DSNs (5.1.1/5.1.10/5.1.0/5.0.0/5.4.1/5.5.0) count, and it + now requires **≥3 distinct hard bounces**. Reputation/policy 5.7.x and + quota/greylist 5.2.x never trigger a blocklist. (`bfdbf8f`) + +- **2026-06-27** - ⭐ **Wrongly-blocklisted recovery send (campaign 727).** + Un-blocklisted 4,317 false-positive carriers (excluding the ~688 real dead + mailboxes), re-sent with a fresh 30%-off coupon. Verified the bounce-sync fix + held live: 727 took ~61 hard bounces but **0 carriers re-blocklisted**. + +- **2026-06-27** - ⭐⭐⭐ **Disk-full Postgres crash, mid-send.** `/` hit **100%** + (orphaned 15GB forgejo backup dump + uncapped Docker logs), Postgres + crash-looped on "No space left on device", and the Listmonk container was + destroyed mid-campaign. Recovered (pruned build cache + dumps + orphan volumes: + 100% → 72%, 62GB free), recreated Listmonk, campaigns auto-resumed. Added a + **Docker log cap** (50m×3) and a **disk-space monitor** (Telegram warn at 90%, + auto-reclaim at 94%) - neither existed before. (`e318f12`, `6b2cf5a`) + +- **2026-06-27** - ⭐ **/24 RBL listing.** The whole `207.174.124.0/24` block + got listed on **invaluement** (ivmSIP + ivmSIP/24) - affects ~11% of + recipients (Intermedia/securence business domains); **Spamhaus / Barracuda / + SpamCop all clean**, so Gmail/Microsoft/Yahoo unaffected. Dialed catch-all back + to smtp_valid-only and submitted a delist request (propagation pending). Also + noted ~73 "very low reputation" rejects are **Google-Workspace custom domains** + the `@gmail.com` filter misses. + +--- + +## Cross-cutting root causes (the recurring themes) + +1. **DKIM not signing (May)** → DMARC rejections misread as hard bounces → + mass false-positive blocklisting. The single most damaging issue. +2. **Over-aggressive blocklisting logic** (both Listmonk's count:1 default *and* + the bounce-sync's own SQL) turned transient/policy bounces into permanent + list death. +3. **Reputation/warmup fragility** - snowshoe IPs, missing dedicated subdomain, + consumer-domain leakage, big-MX (Google/MS) sensitivity during warmup. +4. **Operational guardrails too twitchy or missing** - 2-day catch-all window + whipsaw; no disk monitoring; uncapped Docker logs; IPs dropping off NIC on + reboot; dead alert crons.