docs: email deliverability incident timeline (May-June 2026)
This commit is contained in:
parent
6b2cf5a07b
commit
f163ccea92
1 changed files with 131 additions and 0 deletions
131
docs/email-deliverability-incident-timeline.md
Normal file
131
docs/email-deliverability-incident-timeline.md
Normal file
|
|
@ -0,0 +1,131 @@
|
|||
# Email Deliverability - Incident & Issue Timeline (May–June 2026)
|
||||
|
||||
Listmonk + Postfix (self-hosted MTA) cold-outreach for trucking (IP .94) and
|
||||
healthcare (IP .107) on host 207.174.124.71. Dates are when the issue was
|
||||
identified/fixed; root causes often predate the fix.
|
||||
|
||||
---
|
||||
|
||||
## May
|
||||
|
||||
- **2026-05-21** - Rebuilt the Listmonk bounce-sync after unreliable webhook
|
||||
delivery (Listmonk silently drops bounces it can't FK-match to a subscriber).
|
||||
Switched to log-scraping `/var/log/mail.log` and inserting with real
|
||||
subscriber IDs. (commit `ba2f6eb`)
|
||||
|
||||
- **2026-05-30** - ⭐ **The DKIM disaster blast.** A large trucking blast went
|
||||
out with **broken DKIM signing**, so receivers applied DMARC auth-policy
|
||||
rejection. **7,634 "hard bounces" in one day** - but ~6,604 were DSN `5.7.1`
|
||||
(DMARC/policy failures, *not* bad mailboxes); only ~221 were real dead
|
||||
mailboxes (`5.1.1`). This is the event that poisoned the carrier list.
|
||||
|
||||
- **2026-05-31** → following weeks - Fallout: Listmonk auto-blocklisted on the
|
||||
**first** hard bounce, and the bounce-sync's own SQL also blocklisted on the
|
||||
first 5xx of *any* DSN. Result: **~17,000 carriers wrongly blocklisted** (88%
|
||||
of the list) over the broken-DKIM window. Not discovered as a false-positive
|
||||
until late June.
|
||||
|
||||
---
|
||||
|
||||
## June - root-cause fixes to the sending stack
|
||||
|
||||
- **2026-06-14** - Per-MX-operator throttling added; Google / Microsoft 365
|
||||
(Workspace) excluded from warmup sends. HC warmup corrected to run **daily**
|
||||
for the full 21-day ramp (was weekdays-only, stretching the ramp). (`9e40965`,
|
||||
`2caab6a`)
|
||||
|
||||
- **2026-06-16** - Stopped blasting trucking to `mx_unreachable` dead domains;
|
||||
the verifier was mislabeling live big-ISP mailboxes as unreachable. Suppressed
|
||||
defunct/legacy/satellite ISP domains in cold sends. (`1652a3b`, `1eb29f8`,
|
||||
`c183957`)
|
||||
|
||||
- **2026-06-17** - ⭐ **Root DKIM fix.** Found OpenDKIM was **not signing**
|
||||
campaign mail (the Docker-injected path bypassed signing); fixed and codified
|
||||
in Ansible. Also: added a `text/plain` MIME part to every email (spam-filter
|
||||
requirement), stable Message-ID hostname, Postfix `mail.log` logrotate,
|
||||
decommissioned SMTP2GO (local MTA only). (`4d59019`, `a32a3b0`, `b375385`,
|
||||
`2e4388a`, `a04ecf7`)
|
||||
|
||||
- **2026-06-18** - ⭐ Moved bulk campaigns to a **dedicated subdomain**
|
||||
`send.performancewest.net` (protects the root domain's reputation); Ansible
|
||||
signs it. **Killed the snowshoe IP pattern** now that DKIM works (consolidated
|
||||
sending IPs). Excluded Apple/iCloud consumer mail; began scrubbing stale
|
||||
consumer subscribers from Listmonk. Catch-all pool auto-rollout gated by
|
||||
warmup-day + live bounce rate. (`5c3b429`, `545e6f7`, `b40fc7e`, `40da017`)
|
||||
|
||||
- **2026-06-19** - Removed 18 dormant **snowshoe IPs** from Postfix + host.
|
||||
Built a **mail-reputation monitor** (SNDS-equivalent from Postfix logs) +
|
||||
nightly snapshot cron. Stood up **DMARC aggregate-report ingestion** (dedicated
|
||||
`dmarc@` mailbox + parser); classified the whole `207.174.124.0/24` as ours.
|
||||
(`9dd6f53`, `08f651d`, `b45332b`, `8e5590b`, `707d538`)
|
||||
|
||||
- **2026-06-20** - Bounded the untagged (NULL `mx_provider`) bucket in the
|
||||
selector and closed **MX-exclusion gaps** (consumer MX operators were leaking
|
||||
into cold sends); added an MX-tagging cron. (`9eeed47`, `bc93d93`)
|
||||
|
||||
- **2026-06-21** - Fixed the **Reply-To header shape** - Listmonk was silently
|
||||
dropping a malformed Reply-To. (`e414ec4`)
|
||||
|
||||
- **2026-06-22** - ⭐ **Post-DKIM re-send** to the list, with a **Gmail-only
|
||||
exclusion** (Gmail still distrusted the warming domain). Stepped the trucking
|
||||
rate cap back up to 400/h (day 19–20), 500/h ceiling. (`5a3063e`, `1e9dcfc`)
|
||||
|
||||
- **2026-06-22/23** - Fixed broken CTAs in trucking email: a recurring
|
||||
`@TrackLink` **404** + link-collapse bug, and order CTAs pointing at the wrong
|
||||
($399 catch-all) service page. (`3325259`, `e3f4392`, `a90cdc9`)
|
||||
|
||||
- **2026-06-24** - ⭐ **Sending-IP outage.** The warmed sending IPs **dropped
|
||||
off interface `ens18` on reboot**, so mail stopped/misrouted. Fixed to persist
|
||||
across reboots. Also repaired two dead mail-alert crons + de-noised the DMARC
|
||||
digest. (`4276ada`, `ae68edb`)
|
||||
|
||||
- **2026-06-26** - ⭐ **Volume whipsaw fixed.** The catch-all guardrail used a
|
||||
**2-day** bounce window; one bad batch (Jun 24: 465 sent / 10.75%) flipped
|
||||
catch-all OFF, starving volume so badly it couldn't gather a 300-send sample to
|
||||
re-enable - a self-reinforcing trap. Widened the window **2d → 5d**. Also fixed
|
||||
the HC cron **re-mailing the whole list daily** (added per-day send lists).
|
||||
(`f344287`, `b350a13`)
|
||||
|
||||
- **2026-06-26** - ⭐⭐ **The re-blocklist bomb.** Discovered `listmonk-bounce-sync`
|
||||
(root cron, every 5 min) was blocklisting carriers on the **first hard bounce
|
||||
of *any* 5xx DSN** via direct SQL - bypassing Listmonk's own threshold. *This*
|
||||
is the mechanism that wrongly killed ~17,000 carriers in May. Rewrote it: only
|
||||
genuine bad-mailbox DSNs (5.1.1/5.1.10/5.1.0/5.0.0/5.4.1/5.5.0) count, and it
|
||||
now requires **≥3 distinct hard bounces**. Reputation/policy 5.7.x and
|
||||
quota/greylist 5.2.x never trigger a blocklist. (`bfdbf8f`)
|
||||
|
||||
- **2026-06-27** - ⭐ **Wrongly-blocklisted recovery send (campaign 727).**
|
||||
Un-blocklisted 4,317 false-positive carriers (excluding the ~688 real dead
|
||||
mailboxes), re-sent with a fresh 30%-off coupon. Verified the bounce-sync fix
|
||||
held live: 727 took ~61 hard bounces but **0 carriers re-blocklisted**.
|
||||
|
||||
- **2026-06-27** - ⭐⭐⭐ **Disk-full Postgres crash, mid-send.** `/` hit **100%**
|
||||
(orphaned 15GB forgejo backup dump + uncapped Docker logs), Postgres
|
||||
crash-looped on "No space left on device", and the Listmonk container was
|
||||
destroyed mid-campaign. Recovered (pruned build cache + dumps + orphan volumes:
|
||||
100% → 72%, 62GB free), recreated Listmonk, campaigns auto-resumed. Added a
|
||||
**Docker log cap** (50m×3) and a **disk-space monitor** (Telegram warn at 90%,
|
||||
auto-reclaim at 94%) - neither existed before. (`e318f12`, `6b2cf5a`)
|
||||
|
||||
- **2026-06-27** - ⭐ **/24 RBL listing.** The whole `207.174.124.0/24` block
|
||||
got listed on **invaluement** (ivmSIP + ivmSIP/24) - affects ~11% of
|
||||
recipients (Intermedia/securence business domains); **Spamhaus / Barracuda /
|
||||
SpamCop all clean**, so Gmail/Microsoft/Yahoo unaffected. Dialed catch-all back
|
||||
to smtp_valid-only and submitted a delist request (propagation pending). Also
|
||||
noted ~73 "very low reputation" rejects are **Google-Workspace custom domains**
|
||||
the `@gmail.com` filter misses.
|
||||
|
||||
---
|
||||
|
||||
## Cross-cutting root causes (the recurring themes)
|
||||
|
||||
1. **DKIM not signing (May)** → DMARC rejections misread as hard bounces →
|
||||
mass false-positive blocklisting. The single most damaging issue.
|
||||
2. **Over-aggressive blocklisting logic** (both Listmonk's count:1 default *and*
|
||||
the bounce-sync's own SQL) turned transient/policy bounces into permanent
|
||||
list death.
|
||||
3. **Reputation/warmup fragility** - snowshoe IPs, missing dedicated subdomain,
|
||||
consumer-domain leakage, big-MX (Google/MS) sensitivity during warmup.
|
||||
4. **Operational guardrails too twitchy or missing** - 2-day catch-all window
|
||||
whipsaw; no disk monitoring; uncapped Docker logs; IPs dropping off NIC on
|
||||
reboot; dead alert crons.
|
||||
Loading…
Add table
Add a link
Reference in a new issue