justin 9dd6f53eb2 infra(mail): remove 18 dormant snowshoe IPs from postfix + host

Consolidate the outbound mail footprint to match the SPF intent (already
trimmed to .94/.107 on 2026-06-19). A 20-IP sending footprint reads as
snowshoe spam to receivers and was contributing to domain-reputation
throttling (Microsoft 451 4.7.500, Gmail low-reputation).

Removed from /etc/postfix/master.cf: transports yahooslow, out02-04,
out06-20, rehab02-04, HC submission ports 2527/2528, hcout2/hcout3.
Removed from /etc/network/interfaces (+ live ip addr del): host bindings
.90-.93, .95-.106, .108-.109. Kept: .94 (trucking/out05), .107 (HC/hcout1),
.71/.72 (infra).

Verified live: postfix check OK, both streams still status=sent post-change,
SSH session on .71 unaffected, transport_maps still routes via out05.

Snapshots: infra/postfix/live-snapshots/master.cf, infra/network/interfaces.
Live backups on server: /root/{master.cf,interfaces}.bak_snowshoe_*.

2026-06-23 23:45:41 -05:00

18 KiB

Raw Blame History

Email Deliverability Runbook

Owner action items are marked 🔴 MANUAL. Everything else is already done/automated.

Last updated: 2026-06-19 (bulk subdomain + SPF trim + Microsoft/audience analysis).

TL;DR of the 2026-06-18/19 deliverability incident

Symptom: ~30% "open" rates but 0 human clicks, 0 sales across both trucking and healthcare streams.
Root cause: NOT a blocklist, NOT the IPs. Proven by a controlled A/B test (2026-06-19): from the same mail server / same IPs, a message From justin@carrierone.com landed in the Inbox while From justin@performancewest.net went to Junk. The variable is the From domain's reputation. carrierone.com (reg. 2006, years of steady low-volume mail, tight 2-IP SPF) is trusted; performancewest.net (only started bulk in ~May 2026, broken DKIM until 2026-06-17, 21-IP snowshoe SPF, May 30-31 over-volume blast) is cold/damaged.
Where the audience actually is (24h receiver mix): ~85% Microsoft (M365/Outlook/Hotmail), ~14% Google, <1% Yahoo. Our list is B2B, so Microsoft is the game, not Gmail. Microsoft is NOT reputation-blocking us (only ~1.6% 5.7.x/S3150 rejects; it accepts ~2,138 msgs/24h) — but acceptance != inbox, so the engagement problem there is likely Junk-foldering, same domain-reputation cause. Gmail rejects ~95% of its (smaller) slice on 550-5.7.1 ... very low reputation of the sending domain. The single biggest bounce bucket is actually list hygiene: ~1,012/24h Microsoft 451 4.4.4 no mail-enabled subscriptions (dead tenant domains) + dead recipients.
Fixes applied (2026-06-18/19):
1. Consolidated to ONE IP per stream (snowshoe was a band-aid for broken DKIM).
2. Dedicated bulk subdomain send.performancewest.net so bulk reputation is isolated from the root domain (which stays clean for transactional mail).
3. Trimmed root SPF from 21 IPs to the real 3 (the bloated record was itself a snowshoe signal).
4. Disabled the pointless pw-ip-rehab cron (we have no IP reputation problem).

Bulk subdomain: send.performancewest.net (2026-06-19)

Why: isolate bulk/cold-campaign sending reputation from the root domain. The root domain carries transactional/verification/receipt mail (via co.carrierone.com relay + the .71 default egress) and must stay clean; cold campaigns are inherently reputation-risky. Industry-standard (SendGrid/Mailchimp/etc.) split.

Customer experience is unchanged: From is the subdomain, but Reply-To stays info@performancewest.net, so replies land in the real inbox and look normal.

Piece	Value
Trucking From	`Performance West <noreply@send.performancewest.net>`
Healthcare From	`Performance West Compliance <compliance@send.performancewest.net>`
Reply-To (both)	`info@performancewest.net`
DKIM selector	`send` (`send._domainkey.send.performancewest.net`), 2048-bit
SPF	`v=spf1 ip4:207.174.124.94 ip4:207.174.124.107 -all`
DMARC	inherits root `p=reject` (explicit `_dmarc.send` also published)
MX / Return-Path	`co.carrierone.com` (bounces)
Egress IPs	.94 (trucking) / .107 (HC) — unchanged

Code: from_email is set in scripts/build_trucking_campaigns.py (FROM_EMAIL, env CAMPAIGN_FROM) and scripts/build_healthcare_campaigns_cron.py (FROM_EMAIL, env HC_CAMPAIGN_FROM). Bounce-watchers (scripts/bounce-watcher.sh, scripts/hc-bounce-watcher.sh) track the new subdomain sender (and keep the legacy root sender so the pre-cutover queue drains).

Infra: OpenDKIM signs both domains — see infra/ansible/roles/mail (opendkim_signing_domains list generates per-domain keys + KeyTable/SigningTable). DNS published on the Hestia master (see DNS automation note below). Verified end-to-end 2026-06-19: a test send signs d=send.performancewest.net; s=send; and egresses out05/.94.

Listmonk global app.from_email was also updated in both DBs as a fallback for any UI/test send that doesn't set From explicitly.

⚠️ The subdomain starts at NEUTRAL reputation (not negative, not warm). It still needs the same warm-up discipline: steady low volume to engaged recipients. It is NOT a magic reset — but it protects the root domain and starts cleaner than the damaged root.

Sending architecture (after 2026-06-18/19 consolidation)

Stream	IP	PTR / HELO	Path
Trucking (listmonk)	207.174.124.94	mta05.performancewest.net	listmonk -> :25 -> `randmap:{out05:}`
Healthcare (listmonk-hc)	207.174.124.107	hcmta01.performancewest.net	listmonk-hc SMTP server 1 -> :2526 -> hcout1
Transactional / verification	207.174.124.71 + co.carrierone.com (.15)	perfwest	default `smtp_bind_address` (.71) + :587 relay (.15)
Removed 2026-06-23 (snowshoe cleanup)	.90-.93, .95-.106, .108-.109	mta01-04/06-17, hcmta02-03	transports + host IP bindings DELETED

Snowshoe IP cleanup (2026-06-23): the 18 dormant sending IPs (.90-.93, .95-.106, .108-.109) were fully removed from BOTH postfix (master.cf transports yahooslow/out02-04/out06-20/rehab02-04/2527/2528/ hcout2/hcout3) AND the host (/etc/network/interfaces + live ip addr del). Only the two warm sending IPs (.94 trucking, .107 HC) plus infra (.71/.72) remain bound. A 20-IP footprint reads as snowshoe spam and was hurting domain reputation; the SPF was already trimmed to .94/.107 on 2026-06-19, so this just makes the host/postfix match the SPF intent. Verified live: postfix check OK, both streams still status=sent post-change, SSH unaffected. Reference snapshots committed at infra/postfix/live-snapshots/master.cf + infra/network/interfaces (live backups /root/master.cf.bak_snowshoe_* + /root/interfaces.bak_snowshoe_*).

Root SPF (trimmed 2026-06-19): v=spf1 a mx ip4:207.174.124.15 ip4:207.174.124.94 ip4:207.174.124.107 -all — a=.71, mx=co.carrierone.com(.15), plus the two bulk IPs. The old 21-IP record was a snowshoe signal; this matches carrierone.com's tight style.

To re-expand after reputation is established: add transports back to ALL=() in infra/postfix/pw-mta-warmup.sh and re-enable the HC SMTP servers (ports 2527/2528) in the listmonk_hc DB settings.smtp. Re-expand SLOWLY (one IP at a time, days apart) and only after Postmaster Tools shows a green/medium reputation. If you re-expand, also add the IPs back to BOTH the root SPF and the send subdomain SPF.

DNS automation (Hestia is the master)

DNS is fully automatable — Hestia (cp.carrierone.com, 207.174.124.22) is the DNS master; HE.net are slaves. Access: ssh -p 22022 root@cp.carrierone.com using the local workstation's ~/.ssh/id_ed25519 (NOT the app server, NOT justin@ which is SFTP-only). The justin Hestia user owns the performancewest.net zone.

# add  (note: Hestia appends the base domain to the RECORD name, so a record at
#        send._domainkey.send.performancewest.net needs RECORD = "send._domainkey.send")
v-add-dns-record justin performancewest.net "<record>" <TYPE> "<value>" [prio]
# change / delete (find the numeric id with v-list-dns-records ... plain)
v-change-dns-record justin performancewest.net <id> "<record>" <TYPE> "<value>" "" yes <ttl>
v-delete-dns-record justin performancewest.net <id>
# list
v-list-dns-records  justin performancewest.net plain

Each write triggers a ~30s zone rebuild + DNSSEC re-sign; slaves sync via NOTIFY / SOA refresh, usually within a minute. Verify on @8.8.8.8 AND the master @207.174.124.22 (the master is authoritative; public resolvers may lag).

Monitoring tools (set these up to SEE reputation directly)

These all require a provider account login + (for Google) a DNS TXT record on HE.net, so they can't be fully automated. Steps are pre-filled below.

🔴 MANUAL 1 — Google Postmaster Tools (Gmail is our biggest blocker)

Gmail's verbatim rejection names "the sending domain", so this is priority #1.

DNS is fully automatable — Hestia (cp.carrierone.com) is the DNS master, HE.net are slaves. Add records as root: ssh -p 22022 root@cp.carrierone.com then v-add-dns-record justin performancewest.net "@" TXT '"'"'"<value>"'"'"' (zone owner is the justin Hestia user; ~30s zone rebuild + slaves sync via the 2h SOA refresh / NOTIFY, usually within a minute).

Status 2026-06-18: TXT added + verified live (record id 14464, google-site-verification=p8s3RaN5wi81350wToMpdPMho5Gcel4RGT1Q1SXj7vg), resolving on 8.8.8.8/1.1.1.1/9.9.9.9 and 4/5 HE.net slaves. Owner just needs to click Verify in the Postmaster console once. Data populates 24-48h after volume flows from the consolidated IP.

To set up from scratch next time: postmaster.google.com -> +Add domain -> performancewest.net -> copy the google-site-verification=... token -> add via the Hestia command above -> Verify.

✅ MANUAL 2 — Microsoft SNDS + JMRP (Outlook/Hotmail/Live) — DONE 2026-06-19

85% of our audience is Microsoft-hosted (M365/Outlook/Hotmail), so this is the single most important monitoring tool. Microsoft already accepts our mail (~1.6% reputation rejects), so this tells us inbox-vs-junk + complaint rates. SNDS is IP-based (register the sending IPs), JMRP is the complaint feedback loop. Both SNDS access and JMRP are now registered for 207.174.124.94 + .107.

2026 URL MIGRATION: Microsoft moved SNDS off sendersupport.olc.protection.outlook.com. The old /snds/ and /pm/ links now 308-redirect to the new app at substrate.office.com/ip-domain-management-snds/. The footer/help links on that page ("contact sender support", "Privacy", "Microsoft Services Agreement") go to generic microsoft.com pages — that is normal, they are boilerplate, NOT the broken task. You must click "Log in" (top-right) with a personal Microsoft account FIRST; until you authenticate the "Request Access" / "Junk Mail Reporting Program" links just bounce to login.microsoftonline.com, which looks like a dead redirect but is the expected auth step. After login the real forms render.

SNDS — Request Access: open the SNDS app — either the legacy entry https://sendersupport.olc.protection.outlook.com/snds/ (it 308-redirects to the new app) or directly https://substrate.office.com/ip-domain-management-snds/SNDS — then Log in -> left-nav "Request Access" (direct: https://substrate.office.com/ip-domain-management-snds/SNDS/AddNetwork) -> register IPs 207.174.124.94 and 207.174.124.107 (the two live stream IPs; add .90 and .71 if you want full coverage). Verification goes to a role address on the IP's domain (use postmaster@ or abuse@performancewest.net, now live). (NOTE: snds.microsoft.com does NOT resolve — do not use it.) ✅ DONE 2026-06-19: access requested/granted for .94 + .107. Data populates over ~24-48h; then check the dashboard for the per-IP RED/YELLOW/GREEN status, spam-trap hits, and complaint rate.
JMRP: same site, left-nav "Junk Mail Reporting Program" (direct: https://substrate.office.com/ip-domain-management-snds/SNDS/Jmrp) -> register the same IPs + complaint-destination mailbox fbl@performancewest.net. Complaints then arrive as ARF emails. ✅ DONE 2026-06-19: both IPs registered as feeds — pw1 = 207.174.124.94, pw2 = 207.174.124.107, complaint destination set to fbl@performancewest.net (live, routes to ops@). ARF complaint reports now land there automatically.

✅ PREREQ DONE (2026-06-19): the role mailboxes Microsoft needs now exist and deliver. Created as Carbonio distribution lists routing to ops@performancewest.net: postmaster@, abuse@, fbl@, dmarc@ — all verified ACCEPT at the MX + delivered end-to-end. (They previously REJECTED with 5.1.1, which would have blocked SNDS verification.) Use postmaster@ or abuse@ for SNDS verification and fbl@performancewest.net as the JMRP complaint destination.

Carbonio mail admin: ssh -p 22022 justin@207.174.124.15 (the co.carrierone.com mail host; local workstation key, justin has NOPASSWD sudo). Run prov as zextras: sudo -u zextras /opt/zextras/bin/carbonio prov <cmd> (e.g. gaa, gadl, cdl <addr>, adlm <dl> <member>, gdlm <dl>).

✅ MANUAL 3 — Yahoo Complaint Feedback Loop — keys added 2026-06-19

Lowest priority (<1% of audience), but cheap. CFL is DKIM-d= based.

https://senders.yahooinc.com/complaint-feedback-loop/ -> sign in -> register the domains performancewest.net and send.performancewest.net (CFL keys off the DKIM d= value; bulk mail now signs d=send.performancewest.net).
Set the complaint destination to fbl@performancewest.net (now live, see above).

✅ ENROLLED 2026-06-19 — both domains show Enrolled in the Yahoo Sender Hub CFL with reporting email fbl@performancewest.net:

performancewest.net — Enrolled, reporting fbl@performancewest.net
send.performancewest.net — Enrolled, reporting fbl@performancewest.net (Reporting-email code was delivered to fbl@ → ops@ and verified; the Selector column is intentionally blank = match any DKIM selector on the verified domain.)

✅ DNS verification keys added + propagated 2026-06-19 (Hestia TXT, verified on all HE.net slaves + 8.8.8.8/1.1.1.1/9.9.9.9):

performancewest.net TXT yahoo-verification-key=IMx+OO5aKUE1nu9JwP6eSBMfSYZu8VcXjpkvEVXS84w=
send.performancewest.net TXT yahoo-verification-key=Ps5hGjVxXgeQcLcxr671YG0/RxzjjL0eqh6vfULubEo= (added alongside the existing send SPF record; both TXT coexist).

✅ DMARC aggregate reports — DONE 2026-06-19 (dedicated mailbox + parser)

Gmail/Yahoo/Microsoft + dozens of operators (Comcast, Cox, Bell, Mimecast, Cisco ESA, GMX, mail.com, gosecure, ...) send daily per-IP auth+disposition XML to dmarc@performancewest.net (DMARC record: p=reject; rua=mailto:dmarc@; ruf=mailto:dmarc@; fo=1). That mailbox was REJECTING (5.1.1) until 2026-06-19 — we silently lost every report. Now fully wired:

Dedicated mailbox. dmarc@performancewest.net is its own Carbonio account (was a DL -> ops@, which buried ops@ under report XML). Isolated IMAP credential in the server .env (DMARC_IMAP_{HOST,PORT,USER,PASS}), surfaced to the workers container in docker-compose.yml (mirrors the OPS_IMAP_* pattern). The 29 historical reports that had landed in ops@ were moved over via IMAP.
Parser worker. scripts/dmarc_report_parser.py IMAP-fetches unseen messages, decompresses the .gz/.zip/.xml attachment (namespace-agnostic — handles both the classic and the urn:ietf:params:xml:ns:dmarc-2.0 GMX/mail.com schema), parses the aggregate XML, and upserts one dmarc_report row (keyed (org_name, report_id), so re-parsing is a no-op) + one dmarc_record row per source IP into the schema from api/migrations/102_dmarc_aggregate.sql. dmarc_pass = dkim_aligned=pass OR spf_aligned=pass. Marks each message \Seen so each run only handles new reports. Flags: --dry-run, --all (backfill seen), --alert (7-day per-IP summary + Telegram if one of OUR IPs drops below 95% pass, or an EXTERNAL IP sends >=20 failing msgs as us = spoofing under p=reject).
Cron. /etc/cron.d/pw-dmarc-parser (tracked at infra/cron/pw-dmarc-parser) runs ... workers python3 -m scripts.dmarc_report_parser --alert daily at 06:20 UTC.

Query examples once populated:

-- who sends as us, and are they aligning? (the payoff of the DKIM/subdomain fixes)
SELECT source_ip, sum(msg_count) total,
       sum(msg_count) FILTER (WHERE dmarc_pass) pass,
       round(100.0*sum(msg_count) FILTER (WHERE dmarc_pass)/sum(msg_count)) pass_pct
FROM dmarc_record r JOIN dmarc_report rep ON rep.id=r.report_id
WHERE rep.date_begin >= now()-interval '7 days'
GROUP BY source_ip ORDER BY total DESC;
-- any UNKNOWN IP failing alignment = spoofing/forgotten relay (reputation poison)

Ongoing hygiene (reduce reputation damage)

Dead-address scrub: ~110 genuine 5.1.1 user unknown bounces/day. listmonk already blocklists hard bounces after 1 (bounce.actions hard->blocklist), so these self-clean, but pre-scrubbing the dirtiest segments before send avoids the reputation hit. See data/ segment exports.
Consumer-domain exclusion (two layers). The authoritative list lives in scripts/_email_exclusions.py (BLOCKED_EMAIL_DOMAINS): gmail/google, the full Yahoo/Verizon-Media family, Microsoft consumer, Apple/iCloud (added 2026-06-19), dead/legacy ISPs, and the legal do-not-contact list.
1. NEW selections: the per-vertical builders filter it out of audience SQL and listmonk_import.py refuses to import a blocked address.
2. Already-imported subs: LIST-BASED campaigns (FCC Direct Contacts list 3, CRTC/USF blasts) can still hit consumer subs imported BEFORE a domain joined the list. scripts/scrub_listmonk_consumer.py reconciles the live subscriber table against the exclusion list and blocklists any ENABLED match (idempotent; --dry-run supported; both listmonk + listmonk_hc). Runs daily 06:30 UTC via /etc/cron.d/pw-listmonk-scrub (tracked at infra/cron/pw-listmonk-scrub). First run 2026-06-19 blocklisted 7,943 trucking + 21 HC stale consumer subs (1,321 iCloud, 267 gmail, etc.) that were leaking via the running CRTC campaign. Re-run the scrub whenever you add a domain to the exclusion list.
Don't re-expand IPs until Postmaster Tools shows recovered reputation.
Volume discipline: keep the global 200/hr sliding window until reputation is green; concentrated low volume on one warm IP beats bursts.
Watch the rejection mix: 5.7.1 reputation/spam/blocked should fall over the next 1-2 weeks as the single-IP reputation builds. Track via: ssh ... 'sudo grep status=bounced /var/log/mail.log | grep -c 5.7.1'

18 KiB Raw Blame History