new-site/docs/deliverability.md
justin 3ca960aca5 docs+infra(deliverability): document bulk subdomain; ansible signs send.performancewest.net
- infra/ansible/roles/mail: refactor OpenDKIM to support multiple signing domains
  via opendkim_signing_domains list (root + send.performancewest.net). Loops
  keygen/ownership/keytable/signingtable so the live two-domain setup is
  reproducible from ansible.
- infra/ansible group_vars: add bulk_mail_subdomain + campaign_from_* +
  campaign_reply_to documentation vars (map to CAMPAIGN_FROM / HC_CAMPAIGN_FROM
  env read by the builder scripts). smtp_from (transactional) stays on root.
- docs/deliverability.md: rewrite TL;DR with the carrierone-vs-performancewest
  A/B proof (same server/IPs, different From domain -> Inbox vs Junk) and the
  ~85% Microsoft / 14% Google / <1% Yahoo audience mix; add the bulk-subdomain
  section, SPF trim, rehab-disabled, and the Hestia DNS automation runbook.
2026-06-18 23:12:05 -05:00

10 KiB

Email Deliverability Runbook

Owner action items are marked 🔴 MANUAL. Everything else is already done/automated.

Last updated: 2026-06-19 (bulk subdomain + SPF trim + Microsoft/audience analysis).


TL;DR of the 2026-06-18/19 deliverability incident

  • Symptom: ~30% "open" rates but 0 human clicks, 0 sales across both trucking and healthcare streams.
  • Root cause: NOT a blocklist, NOT the IPs. Proven by a controlled A/B test (2026-06-19): from the same mail server / same IPs, a message From justin@carrierone.com landed in the Inbox while From justin@performancewest.net went to Junk. The variable is the From domain's reputation. carrierone.com (reg. 2006, years of steady low-volume mail, tight 2-IP SPF) is trusted; performancewest.net (only started bulk in ~May 2026, broken DKIM until 2026-06-17, 21-IP snowshoe SPF, May 30-31 over-volume blast) is cold/damaged.
  • Where the audience actually is (24h receiver mix): ~85% Microsoft (M365/Outlook/Hotmail), ~14% Google, <1% Yahoo. Our list is B2B, so Microsoft is the game, not Gmail. Microsoft is NOT reputation-blocking us (only ~1.6% 5.7.x/S3150 rejects; it accepts ~2,138 msgs/24h) — but acceptance != inbox, so the engagement problem there is likely Junk-foldering, same domain-reputation cause. Gmail rejects ~95% of its (smaller) slice on 550-5.7.1 ... very low reputation of the sending domain. The single biggest bounce bucket is actually list hygiene: ~1,012/24h Microsoft 451 4.4.4 no mail-enabled subscriptions (dead tenant domains) + dead recipients.
  • Fixes applied (2026-06-18/19):
    1. Consolidated to ONE IP per stream (snowshoe was a band-aid for broken DKIM).
    2. Dedicated bulk subdomain send.performancewest.net so bulk reputation is isolated from the root domain (which stays clean for transactional mail).
    3. Trimmed root SPF from 21 IPs to the real 3 (the bloated record was itself a snowshoe signal).
    4. Disabled the pointless pw-ip-rehab cron (we have no IP reputation problem).

Bulk subdomain: send.performancewest.net (2026-06-19)

Why: isolate bulk/cold-campaign sending reputation from the root domain. The root domain carries transactional/verification/receipt mail (via co.carrierone.com relay + the .71 default egress) and must stay clean; cold campaigns are inherently reputation-risky. Industry-standard (SendGrid/Mailchimp/etc.) split.

Customer experience is unchanged: From is the subdomain, but Reply-To stays info@performancewest.net, so replies land in the real inbox and look normal.

Piece Value
Trucking From Performance West <noreply@send.performancewest.net>
Healthcare From Performance West Compliance <compliance@send.performancewest.net>
Reply-To (both) info@performancewest.net
DKIM selector send (send._domainkey.send.performancewest.net), 2048-bit
SPF v=spf1 ip4:207.174.124.94 ip4:207.174.124.107 -all
DMARC inherits root p=reject (explicit _dmarc.send also published)
MX / Return-Path co.carrierone.com (bounces)
Egress IPs .94 (trucking) / .107 (HC) — unchanged

Code: from_email is set in scripts/build_trucking_campaigns.py (FROM_EMAIL, env CAMPAIGN_FROM) and scripts/build_healthcare_campaigns_cron.py (FROM_EMAIL, env HC_CAMPAIGN_FROM). Bounce-watchers (scripts/bounce-watcher.sh, scripts/hc-bounce-watcher.sh) track the new subdomain sender (and keep the legacy root sender so the pre-cutover queue drains).

Infra: OpenDKIM signs both domains — see infra/ansible/roles/mail (opendkim_signing_domains list generates per-domain keys + KeyTable/SigningTable). DNS published on the Hestia master (see DNS automation note below). Verified end-to-end 2026-06-19: a test send signs d=send.performancewest.net; s=send; and egresses out05/.94.

Listmonk global app.from_email was also updated in both DBs as a fallback for any UI/test send that doesn't set From explicitly.

⚠️ The subdomain starts at NEUTRAL reputation (not negative, not warm). It still needs the same warm-up discipline: steady low volume to engaged recipients. It is NOT a magic reset — but it protects the root domain and starts cleaner than the damaged root.


Sending architecture (after 2026-06-18/19 consolidation)

Stream IP PTR / HELO Path
Trucking (listmonk) 207.174.124.94 mta05.performancewest.net listmonk -> :25 -> randmap:{out05:}
Healthcare (listmonk-hc) 207.174.124.107 hcmta01.performancewest.net listmonk-hc SMTP server 1 -> :2526 -> hcout1
Transactional / verification 207.174.124.71 + co.carrierone.com (.15) perfwest default smtp_bind_address (.71) + :587 relay (.15)
Yahoo/AOL trickle 207.174.124.90 mta01 yahooslow transport (hash:transport)
Retired (torched May 30-31) .91 / .92 / .93 mta02-04 rehab02-04 — pw-ip-rehab cron DISABLED 2026-06-19
Dormant (re-expand later) .95-.105, .108-.109 mta06-17, hcmta02-03 disabled

Root SPF (trimmed 2026-06-19): v=spf1 a mx ip4:207.174.124.15 ip4:207.174.124.94 ip4:207.174.124.107 -alla=.71, mx=co.carrierone.com(.15), plus the two bulk IPs. The old 21-IP record was a snowshoe signal; this matches carrierone.com's tight style.

To re-expand after reputation is established: add transports back to ALL=() in infra/postfix/pw-mta-warmup.sh and re-enable the HC SMTP servers (ports 2527/2528) in the listmonk_hc DB settings.smtp. Re-expand SLOWLY (one IP at a time, days apart) and only after Postmaster Tools shows a green/medium reputation. If you re-expand, also add the IPs back to BOTH the root SPF and the send subdomain SPF.


DNS automation (Hestia is the master)

DNS is fully automatable — Hestia (cp.carrierone.com, 207.174.124.22) is the DNS master; HE.net are slaves. Access: ssh -p 22022 root@cp.carrierone.com using the local workstation's ~/.ssh/id_ed25519 (NOT the app server, NOT justin@ which is SFTP-only). The justin Hestia user owns the performancewest.net zone.

# add  (note: Hestia appends the base domain to the RECORD name, so a record at
#        send._domainkey.send.performancewest.net needs RECORD = "send._domainkey.send")
v-add-dns-record justin performancewest.net "<record>" <TYPE> "<value>" [prio]
# change / delete (find the numeric id with v-list-dns-records ... plain)
v-change-dns-record justin performancewest.net <id> "<record>" <TYPE> "<value>" "" yes <ttl>
v-delete-dns-record justin performancewest.net <id>
# list
v-list-dns-records  justin performancewest.net plain

Each write triggers a ~30s zone rebuild + DNSSEC re-sign; slaves sync via NOTIFY / SOA refresh, usually within a minute. Verify on @8.8.8.8 AND the master @207.174.124.22 (the master is authoritative; public resolvers may lag).


Monitoring tools (set these up to SEE reputation directly)

These all require a provider account login + (for Google) a DNS TXT record on HE.net, so they can't be fully automated. Steps are pre-filled below.

🔴 MANUAL 1 — Google Postmaster Tools (Gmail is our biggest blocker)

Gmail's verbatim rejection names "the sending domain", so this is priority #1.

DNS is fully automatable — Hestia (cp.carrierone.com) is the DNS master, HE.net are slaves. Add records as root: ssh -p 22022 root@cp.carrierone.com then v-add-dns-record justin performancewest.net "@" TXT '"'"'"<value>"'"'"' (zone owner is the justin Hestia user; ~30s zone rebuild + slaves sync via the 2h SOA refresh / NOTIFY, usually within a minute).

Status 2026-06-18: TXT added + verified live (record id 14464, google-site-verification=p8s3RaN5wi81350wToMpdPMho5Gcel4RGT1Q1SXj7vg), resolving on 8.8.8.8/1.1.1.1/9.9.9.9 and 4/5 HE.net slaves. Owner just needs to click Verify in the Postmaster console once. Data populates 24-48h after volume flows from the consolidated IP.

To set up from scratch next time: postmaster.google.com -> +Add domain -> performancewest.net -> copy the google-site-verification=... token -> add via the Hestia command above -> Verify.

🔴 MANUAL 2 — Microsoft SNDS + JMRP (Outlook/Hotmail/Live)

SNDS is IP-based (register the sending IPs), JMRP is the complaint feedback loop.

  1. SNDS: https://sendersupport.olc.protection.outlook.com/snds/ -> "Request access" -> register IPs: 207.174.124.94 and 207.174.124.107 (the two live stream IPs; add .90 and .71 if you want full coverage). Verification goes to a role address on the IP's domain — use postmaster@performancewest.net or abuse@performancewest.net (ensure one of those receives mail via carrierone).
  2. JMRP: https://sendersupport.olc.protection.outlook.com/pm/ -> sign in with a Microsoft account -> register the same IPs + a complaint-destination mailbox (e.g. fbl@performancewest.net). Complaints then arrive as ARF emails.

🔴 MANUAL 3 — Yahoo Complaint Feedback Loop (Yahoo/AOL + att/sbcglobal/verizon)

  1. https://senders.yahooinc.com/complaint-feedback-loop/ -> sign in -> register the domain performancewest.net (CFL is DKIM-d= based, so it covers all our IPs automatically since they all sign with the same mail._domainkey).
  2. Set the complaint destination to fbl@performancewest.net.

AUTOMATABLE LATER — DMARC aggregate reports (all providers, free)

Gmail/Yahoo/Microsoft already send daily per-IP auth+disposition XML to dmarc@performancewest.net (our DMARC record has rua=mailto:dmarc@...). Nobody parses them yet. If we add IMAP creds for that mailbox (it's on carrierone MX) we can build a small collector/parser worker to chart per-IP pass/fail without any provider login. Deferred — provider dashboards above are faster to stand up.


Ongoing hygiene (reduce reputation damage)

  • Dead-address scrub: ~110 genuine 5.1.1 user unknown bounces/day. listmonk already blocklists hard bounces after 1 (bounce.actions hard->blocklist), so these self-clean, but pre-scrubbing the dirtiest segments before send avoids the reputation hit. See data/ segment exports.
  • Don't re-expand IPs until Postmaster Tools shows recovered reputation.
  • Volume discipline: keep the global 200/hr sliding window until reputation is green; concentrated low volume on one warm IP beats bursts.
  • Watch the rejection mix: 5.7.1 reputation/spam/blocked should fall over the next 1-2 weeks as the single-IP reputation builds. Track via: ssh ... 'sudo grep status=bounced /var/log/mail.log | grep -c 5.7.1'