diff --git a/docs/deliverability.md b/docs/deliverability.md index f577894..81e269e 100644 --- a/docs/deliverability.md +++ b/docs/deliverability.md @@ -2,45 +2,130 @@ **Owner action items are marked 🔴 MANUAL. Everything else is already done/automated.** -Last updated: 2026-06-18 (IP consolidation + monitoring-tools setup). +Last updated: 2026-06-19 (bulk subdomain + SPF trim + Microsoft/audience analysis). --- -## TL;DR of the 2026-06-18 deliverability incident +## TL;DR of the 2026-06-18/19 deliverability incident - **Symptom:** ~30% "open" rates but **0 human clicks, 0 sales** across both trucking and healthcare streams. -- **Root cause:** NOT a blocklist. Swept all 21 sending IPs against ~40 RBLs - (Spamhaus via authoritative NS, Barracuda, SpamCop, SORBS, UCEPROTECT L1/2/3, - Mailspike, SpamRATS, etc.) -> **every IP clean.** The real problem was - **domain reputation**: Gmail rejected ~150 msgs/day with - `550-5.7.1 ... very low reputation of the sending domain`. We were - **snowshoeing** ~3k trucking msgs/day across 12 IPs + ~1.2k healthcare across - 3 IPs, so no single IP sent enough per-receiver volume to build reputation. - This rotation was a band-aid for the **broken DKIM** (fixed 2026-06-17) and the - May 30-31 over-volume blast. -- **Fix applied:** consolidated to ONE IP per stream (below) so each accrues real - reputation now that DKIM signs correctly. +- **Root cause:** NOT a blocklist, NOT the IPs. Proven by a controlled A/B test + (2026-06-19): from the **same mail server / same IPs**, a message From + `justin@carrierone.com` landed in the **Inbox** while From + `justin@performancewest.net` went to **Junk**. The variable is the **From + domain's reputation**. `carrierone.com` (reg. 2006, years of steady low-volume + mail, tight 2-IP SPF) is trusted; `performancewest.net` (only started bulk in + ~May 2026, broken DKIM until 2026-06-17, 21-IP snowshoe SPF, May 30-31 + over-volume blast) is cold/damaged. +- **Where the audience actually is (24h receiver mix):** **~85% Microsoft** + (M365/Outlook/Hotmail), ~14% Google, <1% Yahoo. Our list is B2B, so Microsoft + is the game, not Gmail. **Microsoft is NOT reputation-blocking us** (only ~1.6% + 5.7.x/S3150 rejects; it accepts ~2,138 msgs/24h) — but acceptance != inbox, so + the engagement problem there is likely Junk-foldering, same domain-reputation + cause. Gmail rejects ~95% of its (smaller) slice on `550-5.7.1 ... very low + reputation of the sending domain`. The single biggest bounce bucket is actually + **list hygiene**: ~1,012/24h Microsoft `451 4.4.4 no mail-enabled subscriptions` + (dead tenant domains) + dead recipients. +- **Fixes applied (2026-06-18/19):** + 1. Consolidated to ONE IP per stream (snowshoe was a band-aid for broken DKIM). + 2. **Dedicated bulk subdomain** `send.performancewest.net` so bulk reputation is + isolated from the root domain (which stays clean for transactional mail). + 3. Trimmed root SPF from 21 IPs to the real 3 (the bloated record was itself a + snowshoe signal). + 4. Disabled the pointless `pw-ip-rehab` cron (we have no IP reputation problem). --- -## Sending architecture (after 2026-06-18 consolidation) +## Bulk subdomain: send.performancewest.net (2026-06-19) + +**Why:** isolate bulk/cold-campaign sending reputation from the root domain. The +root domain carries transactional/verification/receipt mail (via co.carrierone.com +relay + the .71 default egress) and must stay clean; cold campaigns are inherently +reputation-risky. Industry-standard (SendGrid/Mailchimp/etc.) split. + +**Customer experience is unchanged:** From is the subdomain, but **Reply-To stays +`info@performancewest.net`**, so replies land in the real inbox and look normal. + +| Piece | Value | +|-------|-------| +| Trucking From | `Performance West ` | +| Healthcare From | `Performance West Compliance ` | +| Reply-To (both) | `info@performancewest.net` | +| DKIM selector | `send` (`send._domainkey.send.performancewest.net`), 2048-bit | +| SPF | `v=spf1 ip4:207.174.124.94 ip4:207.174.124.107 -all` | +| DMARC | inherits root `p=reject` (explicit `_dmarc.send` also published) | +| MX / Return-Path | `co.carrierone.com` (bounces) | +| Egress IPs | .94 (trucking) / .107 (HC) — unchanged | + +**Code:** `from_email` is set in `scripts/build_trucking_campaigns.py` (`FROM_EMAIL`, +env `CAMPAIGN_FROM`) and `scripts/build_healthcare_campaigns_cron.py` (`FROM_EMAIL`, +env `HC_CAMPAIGN_FROM`). Bounce-watchers (`scripts/bounce-watcher.sh`, +`scripts/hc-bounce-watcher.sh`) track the new subdomain sender (and keep the legacy +root sender so the pre-cutover queue drains). + +**Infra:** OpenDKIM signs both domains — see `infra/ansible/roles/mail` +(`opendkim_signing_domains` list generates per-domain keys + KeyTable/SigningTable). +DNS published on the Hestia master (see DNS automation note below). Verified +end-to-end 2026-06-19: a test send signs `d=send.performancewest.net; s=send;` and +egresses out05/.94. + +**Listmonk global `app.from_email`** was also updated in both DBs as a fallback for +any UI/test send that doesn't set From explicitly. + +> ⚠️ The subdomain starts at NEUTRAL reputation (not negative, not warm). It still +> needs the same warm-up discipline: steady low volume to engaged recipients. It is +> NOT a magic reset — but it protects the root domain and starts cleaner than the +> damaged root. + +--- + +## Sending architecture (after 2026-06-18/19 consolidation) | Stream | IP | PTR / HELO | Path | |--------|----|-----------|----| | **Trucking** (listmonk) | **207.174.124.94** | mta05.performancewest.net | listmonk -> :25 -> `randmap:{out05:}` | | **Healthcare** (listmonk-hc) | **207.174.124.107** | hcmta01.performancewest.net | listmonk-hc SMTP server 1 -> :2526 -> hcout1 | +| Transactional / verification | 207.174.124.71 + co.carrierone.com (.15) | perfwest | default `smtp_bind_address` (.71) + :587 relay (.15) | | Yahoo/AOL trickle | 207.174.124.90 | mta01 | `yahooslow` transport (hash:transport) | -| Transactional | 207.174.124.71 | perfwest | default `smtp_bind_address` | -| Retired (torched May 30-31) | .91 / .92 / .93 | mta02-04 | rehab02-04 (reputation rebuild only) | +| Retired (torched May 30-31) | .91 / .92 / .93 | mta02-04 | rehab02-04 — **`pw-ip-rehab` cron DISABLED 2026-06-19** | | Dormant (re-expand later) | .95-.105, .108-.109 | mta06-17, hcmta02-03 | disabled | +**Root SPF (trimmed 2026-06-19):** `v=spf1 a mx ip4:207.174.124.15 +ip4:207.174.124.94 ip4:207.174.124.107 -all` — `a`=.71, `mx`=co.carrierone.com(.15), +plus the two bulk IPs. The old 21-IP record was a snowshoe signal; this matches +carrierone.com's tight style. + **To re-expand after reputation is established:** add transports back to `ALL=()` in `infra/postfix/pw-mta-warmup.sh` and re-enable the HC SMTP servers (ports 2527/2528) in the `listmonk_hc` DB `settings.smtp`. Re-expand SLOWLY (one IP at a time, days apart) and only after Postmaster Tools shows a green/medium reputation. +If you re-expand, also add the IPs back to BOTH the root SPF and the `send` +subdomain SPF. -SPF authorizes the whole `.71/.90-.109` set already — harmless, gives flexibility. +--- + +## DNS automation (Hestia is the master) + +**DNS is fully automatable** — Hestia (`cp.carrierone.com`, 207.174.124.22) is the +DNS master; HE.net are slaves. Access: `ssh -p 22022 root@cp.carrierone.com` using +the **local workstation's** `~/.ssh/id_ed25519` (NOT the app server, NOT justin@ +which is SFTP-only). The `justin` Hestia user owns the `performancewest.net` zone. + +``` +# add (note: Hestia appends the base domain to the RECORD name, so a record at +# send._domainkey.send.performancewest.net needs RECORD = "send._domainkey.send") +v-add-dns-record justin performancewest.net "" "" [prio] +# change / delete (find the numeric id with v-list-dns-records ... plain) +v-change-dns-record justin performancewest.net "" "" "" yes +v-delete-dns-record justin performancewest.net +# list +v-list-dns-records justin performancewest.net plain +``` + +Each write triggers a ~30s zone rebuild + DNSSEC re-sign; slaves sync via NOTIFY / +SOA refresh, usually within a minute. Verify on `@8.8.8.8` AND the master +`@207.174.124.22` (the master is authoritative; public resolvers may lag). --- diff --git a/infra/ansible/inventory/group_vars/all.yml b/infra/ansible/inventory/group_vars/all.yml index 174e465..1a958c9 100644 --- a/infra/ansible/inventory/group_vars/all.yml +++ b/infra/ansible/inventory/group_vars/all.yml @@ -80,6 +80,21 @@ smtp_pass: "{{ vault_smtp_pass }}" smtp_from: "Performance West " smtp_admin_email: ops@performancewest.net +# ── Bulk campaign From (Listmonk) ──────────────────────────────────────────── +# Cold/bulk campaign mail is sent From a dedicated bulk subdomain so its sending +# reputation is ISOLATED from the root domain. The root domain (smtp_from above) +# carries transactional/verification/receipt mail and stays clean. Replies still +# route to the root domain via Reply-To, so the customer reply experience is +# unchanged. These map to the CAMPAIGN_FROM / HC_CAMPAIGN_FROM env vars read by +# scripts/build_trucking_campaigns.py and build_healthcare_campaigns_cron.py. +# See docs/deliverability.md. The subdomain's DNS (A/MX/SPF/DKIM selector=send/ +# DMARC) is published on the Hestia DNS master; OpenDKIM signs it (see role mail, +# opendkim_signing_domains). +bulk_mail_subdomain: send.performancewest.net +campaign_from_trucking: "Performance West " +campaign_from_healthcare: "Performance West Compliance " +campaign_reply_to: info@performancewest.net + # ── Listmonk (mass-mail via the LOCAL MTA) ─────────────────────────────────── # Listmonk SMTP is configured via its web admin UI, not env vars. Listmonk relays # through the host Postfix (172.18.0.1:25 from inside the Docker network), which diff --git a/infra/ansible/roles/mail/defaults/main.yml b/infra/ansible/roles/mail/defaults/main.yml index e027fe6..1d6cb75 100644 --- a/infra/ansible/roles/mail/defaults/main.yml +++ b/infra/ansible/roles/mail/defaults/main.yml @@ -13,6 +13,19 @@ opendkim_selector: mail opendkim_signing_domain: performancewest.net opendkim_socket: "inet:8891@localhost" +# Signing domains. The root domain carries transactional/verification mail; the +# dedicated bulk subdomain (send.performancewest.net) carries Listmonk campaign +# mail so its sending reputation is isolated from the root domain (which then +# stays clean and recovers faster). Each entry generates its own key + selector +# and contributes a line to KeyTable/SigningTable. The first entry is treated as +# the primary (kept for backwards-compat with opendkim_signing_domain above). +# See docs/deliverability.md. +opendkim_signing_domains: + - domain: "{{ opendkim_signing_domain }}" + selector: "{{ opendkim_selector }}" + - domain: "send.performancewest.net" + selector: "send" + # Hosts OpenDKIM will SIGN for (vs verify). Must include the Docker bridge # subnet so Listmonk container traffic is signed. opendkim_internal_hosts: diff --git a/infra/ansible/roles/mail/tasks/main.yml b/infra/ansible/roles/mail/tasks/main.yml index 252fa85..88862e1 100644 --- a/infra/ansible/roles/mail/tasks/main.yml +++ b/infra/ansible/roles/mail/tasks/main.yml @@ -8,43 +8,57 @@ - name: Ensure OpenDKIM key directory exists ansible.builtin.file: - path: "/etc/opendkim/keys/{{ opendkim_signing_domain }}" + path: "/etc/opendkim/keys/{{ item.domain }}" state: directory owner: opendkim group: opendkim mode: "0750" + loop: "{{ opendkim_signing_domains }}" + loop_control: + label: "{{ item.domain }}" - name: Generate DKIM keypair if missing ansible.builtin.command: cmd: >- opendkim-genkey -b 2048 - -d {{ opendkim_signing_domain }} - -s {{ opendkim_selector }} - -D /etc/opendkim/keys/{{ opendkim_signing_domain }} - creates: "/etc/opendkim/keys/{{ opendkim_signing_domain }}/{{ opendkim_selector }}.private" + -d {{ item.domain }} + -s {{ item.selector }} + -D /etc/opendkim/keys/{{ item.domain }} + creates: "/etc/opendkim/keys/{{ item.domain }}/{{ item.selector }}.private" + loop: "{{ opendkim_signing_domains }}" + loop_control: + label: "{{ item.domain }} ({{ item.selector }})" register: dkim_keygen - name: Fix DKIM private key ownership ansible.builtin.file: - path: "/etc/opendkim/keys/{{ opendkim_signing_domain }}/{{ opendkim_selector }}.private" + path: "/etc/opendkim/keys/{{ item.domain }}/{{ item.selector }}.private" owner: opendkim group: opendkim mode: "0600" + loop: "{{ opendkim_signing_domains }}" + loop_control: + label: "{{ item.domain }}" -- name: Show DKIM public DNS record to publish (only when newly generated) +- name: Show DKIM public DNS records to publish (only when newly generated) ansible.builtin.debug: msg: >- A new DKIM key was generated. Publish the TXT record from - /etc/opendkim/keys/{{ opendkim_signing_domain }}/{{ opendkim_selector }}.txt - at {{ opendkim_selector }}._domainkey.{{ opendkim_signing_domain }} - when: dkim_keygen is changed + /etc/opendkim/keys/{{ item.item.domain }}/{{ item.item.selector }}.txt + at {{ item.item.selector }}._domainkey.{{ item.item.domain }} + loop: "{{ dkim_keygen.results }}" + loop_control: + label: "{{ item.item.domain }}" + when: item is changed - name: Deploy OpenDKIM KeyTable ansible.builtin.copy: dest: /etc/opendkim/key.table content: | - {{ opendkim_selector }}._domainkey.{{ opendkim_signing_domain }} {{ opendkim_signing_domain }}:{{ opendkim_selector }}:/etc/opendkim/keys/{{ opendkim_signing_domain }}/{{ opendkim_selector }}.private + {% for d in opendkim_signing_domains %} + {{ d.selector }}._domainkey.{{ d.domain }} {{ d.domain }}:{{ d.selector }}:/etc/opendkim/keys/{{ d.domain }}/{{ d.selector }}.private + {% endfor %} owner: root group: root mode: "0644" @@ -54,7 +68,9 @@ ansible.builtin.copy: dest: /etc/opendkim/signing.table content: | - *@{{ opendkim_signing_domain }} {{ opendkim_selector }}._domainkey.{{ opendkim_signing_domain }} + {% for d in opendkim_signing_domains %} + *@{{ d.domain }} {{ d.selector }}._domainkey.{{ d.domain }} + {% endfor %} owner: root group: root mode: "0644"