justin c20edb28cd docs(deliverability): document Gmail re-enablement stale-Date/burial fix

When we resume Gmail sends, the front-loaded-inject + slow-drain pattern
buries mail: Listmonk stamps Date at injection (verified live: queued msg
Date matched postfix arrival, deferred 4h47m later), and Gmail sorts the
inbox by the Date header. So a msg injected at 08:00 but accepted at 14:00
files 6h down a Gmail inbox.

Documents: why NOT to future-date the Date header (spam signal + breaks our
DKIM which signs Date + doesn't help Outlook's received-time sort), and the
real fix -- pace Listmonk injection to match Gmail's accept rate (just-in-time
Date) via a dedicated Gmail stream on its own IP + low sliding-window rate +
queue-age guard. Outlook/M365 (current audience) sorts by received time so the
burial is cosmetic there and not worth fixing.

Procedure only; Gmail still excluded in _email_exclusions.py until re-enabled.

2026-06-24 01:24:24 -05:00

22 KiB

Raw Blame History

Email Deliverability Runbook

Owner action items are marked 🔴 MANUAL. Everything else is already done/automated.

Last updated: 2026-06-19 (bulk subdomain + SPF trim + Microsoft/audience analysis).

TL;DR of the 2026-06-18/19 deliverability incident

Symptom: ~30% "open" rates but 0 human clicks, 0 sales across both trucking and healthcare streams.
Root cause: NOT a blocklist, NOT the IPs. Proven by a controlled A/B test (2026-06-19): from the same mail server / same IPs, a message From justin@carrierone.com landed in the Inbox while From justin@performancewest.net went to Junk. The variable is the From domain's reputation. carrierone.com (reg. 2006, years of steady low-volume mail, tight 2-IP SPF) is trusted; performancewest.net (only started bulk in ~May 2026, broken DKIM until 2026-06-17, 21-IP snowshoe SPF, May 30-31 over-volume blast) is cold/damaged.
Where the audience actually is (24h receiver mix): ~85% Microsoft (M365/Outlook/Hotmail), ~14% Google, <1% Yahoo. Our list is B2B, so Microsoft is the game, not Gmail. Microsoft is NOT reputation-blocking us (only ~1.6% 5.7.x/S3150 rejects; it accepts ~2,138 msgs/24h) — but acceptance != inbox, so the engagement problem there is likely Junk-foldering, same domain-reputation cause. Gmail rejects ~95% of its (smaller) slice on 550-5.7.1 ... very low reputation of the sending domain. The single biggest bounce bucket is actually list hygiene: ~1,012/24h Microsoft 451 4.4.4 no mail-enabled subscriptions (dead tenant domains) + dead recipients.
Fixes applied (2026-06-18/19):
1. Consolidated to ONE IP per stream (snowshoe was a band-aid for broken DKIM).
2. Dedicated bulk subdomain send.performancewest.net so bulk reputation is isolated from the root domain (which stays clean for transactional mail).
3. Trimmed root SPF from 21 IPs to the real 3 (the bloated record was itself a snowshoe signal).
4. Disabled the pointless pw-ip-rehab cron (we have no IP reputation problem).

Bulk subdomain: send.performancewest.net (2026-06-19)

Why: isolate bulk/cold-campaign sending reputation from the root domain. The root domain carries transactional/verification/receipt mail (via co.carrierone.com relay + the .71 default egress) and must stay clean; cold campaigns are inherently reputation-risky. Industry-standard (SendGrid/Mailchimp/etc.) split.

Customer experience is unchanged: From is the subdomain, but Reply-To stays info@performancewest.net, so replies land in the real inbox and look normal.

Piece	Value
Trucking From	`Performance West <noreply@send.performancewest.net>`
Healthcare From	`Performance West Compliance <compliance@send.performancewest.net>`
Reply-To (both)	`info@performancewest.net`
DKIM selector	`send` (`send._domainkey.send.performancewest.net`), 2048-bit
SPF	`v=spf1 ip4:207.174.124.94 ip4:207.174.124.107 -all`
DMARC	inherits root `p=reject` (explicit `_dmarc.send` also published)
MX / Return-Path	`co.carrierone.com` (bounces)
Egress IPs	.94 (trucking) / .107 (HC) — unchanged

Code: from_email is set in scripts/build_trucking_campaigns.py (FROM_EMAIL, env CAMPAIGN_FROM) and scripts/build_healthcare_campaigns_cron.py (FROM_EMAIL, env HC_CAMPAIGN_FROM). Bounce-watchers (scripts/bounce-watcher.sh, scripts/hc-bounce-watcher.sh) track the new subdomain sender (and keep the legacy root sender so the pre-cutover queue drains).

Infra: OpenDKIM signs both domains — see infra/ansible/roles/mail (opendkim_signing_domains list generates per-domain keys + KeyTable/SigningTable). DNS published on the Hestia master (see DNS automation note below). Verified end-to-end 2026-06-19: a test send signs d=send.performancewest.net; s=send; and egresses out05/.94.

Listmonk global app.from_email was also updated in both DBs as a fallback for any UI/test send that doesn't set From explicitly.

⚠️ The subdomain starts at NEUTRAL reputation (not negative, not warm). It still needs the same warm-up discipline: steady low volume to engaged recipients. It is NOT a magic reset — but it protects the root domain and starts cleaner than the damaged root.

Sending architecture (after 2026-06-18/19 consolidation)

Stream	IP	PTR / HELO	Path
Trucking (listmonk)	207.174.124.94	mta05.performancewest.net	listmonk -> :25 -> `randmap:{out05:}`
Healthcare (listmonk-hc)	207.174.124.107	hcmta01.performancewest.net	listmonk-hc SMTP server 1 -> :2526 -> hcout1
Transactional / verification	207.174.124.71 + co.carrierone.com (.15)	perfwest	default `smtp_bind_address` (.71) + :587 relay (.15)
Removed 2026-06-23 (snowshoe cleanup)	.90-.93, .95-.106, .108-.109	mta01-04/06-17, hcmta02-03	transports + host IP bindings DELETED

Snowshoe IP cleanup (2026-06-23): the 18 dormant sending IPs (.90-.93, .95-.106, .108-.109) were fully removed from BOTH postfix (master.cf transports yahooslow/out02-04/out06-20/rehab02-04/2527/2528/ hcout2/hcout3) AND the host (/etc/network/interfaces + live ip addr del). Only the two warm sending IPs (.94 trucking, .107 HC) plus infra (.71/.72) remain bound. A 20-IP footprint reads as snowshoe spam and was hurting domain reputation; the SPF was already trimmed to .94/.107 on 2026-06-19, so this just makes the host/postfix match the SPF intent. Verified live: postfix check OK, both streams still status=sent post-change, SSH unaffected. Reference snapshots committed at infra/postfix/live-snapshots/master.cf + infra/network/interfaces (live backups /root/master.cf.bak_snowshoe_* + /root/interfaces.bak_snowshoe_*).

Root SPF (trimmed 2026-06-19): v=spf1 a mx ip4:207.174.124.15 ip4:207.174.124.94 ip4:207.174.124.107 -all — a=.71, mx=co.carrierone.com(.15), plus the two bulk IPs. The old 21-IP record was a snowshoe signal; this matches carrierone.com's tight style.

To re-expand after reputation is established: add transports back to ALL=() in infra/postfix/pw-mta-warmup.sh and re-enable the HC SMTP servers (ports 2527/2528) in the listmonk_hc DB settings.smtp. Re-expand SLOWLY (one IP at a time, days apart) and only after Postmaster Tools shows a green/medium reputation. If you re-expand, also add the IPs back to BOTH the root SPF and the send subdomain SPF.

Resuming Gmail sends: the stale-Date / inbox-burial problem (READ BEFORE re-enabling Gmail)

Status: Gmail is currently EXCLUDED from all sends (scripts/_email_exclusions.py BLOCKED_EMAIL_DOMAINS includes gmail/google). This section is the documented procedure for when we resume Gmail, and the reasoning for the chosen design. It is NOT yet implemented — implement it at the moment Gmail is re-enabled.

The problem

We inject the whole daily batch into Postfix in a ~2.5h burst (today: 1,430 + 1,419

1,077 messages in the 07:00-09:30 window, with a 932-in-one-minute spike at 08:30), then Postfix slow-drains the queue over ~24h because receivers throttle a warming IP/domain (Microsoft 451 4.7.500 Server busy).

Listmonk stamps the Date: header at the moment it hands each message to Postfix (injection time), NOT at delivery time. Empirically verified 2026-06-23: a queued message had Date: 19:47:28 matching its Postfix arrival log line exactly, and was still deferred ~4h47m later. So a message injected at 08:00 keeps an 08:00 Date: even when the receiver finally accepts it at 14:00.

Why this matters ONLY for Gmail: inbox sort order depends on the client.

Outlook / Exchange / M365 (our current #1 audience, ~2,000 delivered/day) and most webmail (Proton, etc.) sort by received time (PR_MESSAGE_DELIVERY_TIME) = when THEIR server accepted it. A late-delivered message surfaces fresh at the top on arrival; only the displayed date looks old. So for today's audience the burial is cosmetic and NOT worth fixing.
Gmail sorts the inbox by the Date: header. A message accepted at 14:00 but Date-stamped 08:00 is filed 6h down the inbox, below mail the user has already read. That is real burial and real lost opens — and it only bites once we send Gmail again (which is ~85% Microsoft / ~14% Google for our B2B list, so Gmail is a meaningful slice).

Why NOT to future-date / spoof the `Date:` header

The tempting "just stamp a future Date" fix is a net negative:

Spam signal. A Date: in the future is a classic filter heuristic — Proofpoint, Mimecast, and Microsoft all penalize it. We'd trade a cosmetic timestamp for WORSE inbox placement.
It breaks our DKIM. OpenDKIM signs the Date header (only From is over-signed, but Date is in the signed set). Rewriting Date after signing invalidates the signature -> DMARC p=reject -> hard bounce.
It doesn't even help Outlook (received-time sort) and is the wrong lever for Gmail (see the real fix below).

The fix: pace Listmonk INJECTION to match Gmail's accept rate (just-in-time Date)

Because Date: is stamped at injection, the solution is to release each Gmail message close to when Gmail will actually accept it, so Date: ≈ received time ≈ now, and it lands at the top of the Gmail inbox. Keep the Postfix queue shallow for the Gmail stream so no message sits for hours collecting a stale Date.

Implementation when re-enabling Gmail:

Segment Gmail into its OWN Listmonk campaign on its OWN single IP (snowshoe- safe), separate from the Microsoft/Proofpoint stream, so its deliberately slow pace does not bottleneck the fast stream. Each stream gets its own injection cadence. (Add the new IP to host + Postfix transport + BOTH SPF records first, per the re-expand note above.)
Set the Gmail campaign's sliding-window injection rate at or below Gmail's sustained cold-domain accept rate (app.message_sliding_window_rate / _duration on that Listmonk instance). Start low (~20-30/hr/IP for a cold domain) and ramp as Postmaster Tools reputation climbs. This spreads injection across the whole sending window instead of front-loading it, so the queue never builds a backlog of stale-dated Gmail mail.
Queue-age guard. Monitor the inject->deliver gap for the Gmail stream (delay= in the maillog). If it exceeds ~30 min, injection is outrunning acceptance -> throttle the sliding-window rate down further. Verify after a day that the Gmail stream's delay= stays small and the "6-24h late" bucket is ~0.

This is strictly better than date-spoofing: no spam signal, no DKIM break, and because Gmail/Microsoft both reward steady paced volume, pacing injection also RAISES the accept quota over time (the deliverability principle "concentrated low volume beats bursts"). Win-win.

Note: this same pacing slightly helps Outlook's displayed date too, but since Outlook sorts by received time it is not necessary there. Only spend the effort on the Gmail stream.

DNS automation (Hestia is the master)

DNS is fully automatable — Hestia (cp.carrierone.com, 207.174.124.22) is the DNS master; HE.net are slaves. Access: ssh -p 22022 root@cp.carrierone.com using the local workstation's ~/.ssh/id_ed25519 (NOT the app server, NOT justin@ which is SFTP-only). The justin Hestia user owns the performancewest.net zone.

# add  (note: Hestia appends the base domain to the RECORD name, so a record at
#        send._domainkey.send.performancewest.net needs RECORD = "send._domainkey.send")
v-add-dns-record justin performancewest.net "<record>" <TYPE> "<value>" [prio]
# change / delete (find the numeric id with v-list-dns-records ... plain)
v-change-dns-record justin performancewest.net <id> "<record>" <TYPE> "<value>" "" yes <ttl>
v-delete-dns-record justin performancewest.net <id>
# list
v-list-dns-records  justin performancewest.net plain

Each write triggers a ~30s zone rebuild + DNSSEC re-sign; slaves sync via NOTIFY / SOA refresh, usually within a minute. Verify on @8.8.8.8 AND the master @207.174.124.22 (the master is authoritative; public resolvers may lag).

Monitoring tools (set these up to SEE reputation directly)

These all require a provider account login + (for Google) a DNS TXT record on HE.net, so they can't be fully automated. Steps are pre-filled below.

🔴 MANUAL 1 — Google Postmaster Tools (Gmail is our biggest blocker)

Gmail's verbatim rejection names "the sending domain", so this is priority #1.

DNS is fully automatable — Hestia (cp.carrierone.com) is the DNS master, HE.net are slaves. Add records as root: ssh -p 22022 root@cp.carrierone.com then v-add-dns-record justin performancewest.net "@" TXT '"'"'"<value>"'"'"' (zone owner is the justin Hestia user; ~30s zone rebuild + slaves sync via the 2h SOA refresh / NOTIFY, usually within a minute).

Status 2026-06-18: TXT added + verified live (record id 14464, google-site-verification=p8s3RaN5wi81350wToMpdPMho5Gcel4RGT1Q1SXj7vg), resolving on 8.8.8.8/1.1.1.1/9.9.9.9 and 4/5 HE.net slaves. Owner just needs to click Verify in the Postmaster console once. Data populates 24-48h after volume flows from the consolidated IP.

To set up from scratch next time: postmaster.google.com -> +Add domain -> performancewest.net -> copy the google-site-verification=... token -> add via the Hestia command above -> Verify.

✅ MANUAL 2 — Microsoft SNDS + JMRP (Outlook/Hotmail/Live) — DONE 2026-06-19

85% of our audience is Microsoft-hosted (M365/Outlook/Hotmail), so this is the single most important monitoring tool. Microsoft already accepts our mail (~1.6% reputation rejects), so this tells us inbox-vs-junk + complaint rates. SNDS is IP-based (register the sending IPs), JMRP is the complaint feedback loop. Both SNDS access and JMRP are now registered for 207.174.124.94 + .107.

2026 URL MIGRATION: Microsoft moved SNDS off sendersupport.olc.protection.outlook.com. The old /snds/ and /pm/ links now 308-redirect to the new app at substrate.office.com/ip-domain-management-snds/. The footer/help links on that page ("contact sender support", "Privacy", "Microsoft Services Agreement") go to generic microsoft.com pages — that is normal, they are boilerplate, NOT the broken task. You must click "Log in" (top-right) with a personal Microsoft account FIRST; until you authenticate the "Request Access" / "Junk Mail Reporting Program" links just bounce to login.microsoftonline.com, which looks like a dead redirect but is the expected auth step. After login the real forms render.

SNDS — Request Access: open the SNDS app — either the legacy entry https://sendersupport.olc.protection.outlook.com/snds/ (it 308-redirects to the new app) or directly https://substrate.office.com/ip-domain-management-snds/SNDS — then Log in -> left-nav "Request Access" (direct: https://substrate.office.com/ip-domain-management-snds/SNDS/AddNetwork) -> register IPs 207.174.124.94 and 207.174.124.107 (the two live stream IPs; add .90 and .71 if you want full coverage). Verification goes to a role address on the IP's domain (use postmaster@ or abuse@performancewest.net, now live). (NOTE: snds.microsoft.com does NOT resolve — do not use it.) ✅ DONE 2026-06-19: access requested/granted for .94 + .107. Data populates over ~24-48h; then check the dashboard for the per-IP RED/YELLOW/GREEN status, spam-trap hits, and complaint rate.
JMRP: same site, left-nav "Junk Mail Reporting Program" (direct: https://substrate.office.com/ip-domain-management-snds/SNDS/Jmrp) -> register the same IPs + complaint-destination mailbox fbl@performancewest.net. Complaints then arrive as ARF emails. ✅ DONE 2026-06-19: both IPs registered as feeds — pw1 = 207.174.124.94, pw2 = 207.174.124.107, complaint destination set to fbl@performancewest.net (live, routes to ops@). ARF complaint reports now land there automatically.

✅ PREREQ DONE (2026-06-19): the role mailboxes Microsoft needs now exist and deliver. Created as Carbonio distribution lists routing to ops@performancewest.net: postmaster@, abuse@, fbl@, dmarc@ — all verified ACCEPT at the MX + delivered end-to-end. (They previously REJECTED with 5.1.1, which would have blocked SNDS verification.) Use postmaster@ or abuse@ for SNDS verification and fbl@performancewest.net as the JMRP complaint destination.

Carbonio mail admin: ssh -p 22022 justin@207.174.124.15 (the co.carrierone.com mail host; local workstation key, justin has NOPASSWD sudo). Run prov as zextras: sudo -u zextras /opt/zextras/bin/carbonio prov <cmd> (e.g. gaa, gadl, cdl <addr>, adlm <dl> <member>, gdlm <dl>).

✅ MANUAL 3 — Yahoo Complaint Feedback Loop — keys added 2026-06-19

Lowest priority (<1% of audience), but cheap. CFL is DKIM-d= based.

https://senders.yahooinc.com/complaint-feedback-loop/ -> sign in -> register the domains performancewest.net and send.performancewest.net (CFL keys off the DKIM d= value; bulk mail now signs d=send.performancewest.net).
Set the complaint destination to fbl@performancewest.net (now live, see above).

✅ ENROLLED 2026-06-19 — both domains show Enrolled in the Yahoo Sender Hub CFL with reporting email fbl@performancewest.net:

performancewest.net — Enrolled, reporting fbl@performancewest.net
send.performancewest.net — Enrolled, reporting fbl@performancewest.net (Reporting-email code was delivered to fbl@ → ops@ and verified; the Selector column is intentionally blank = match any DKIM selector on the verified domain.)

✅ DNS verification keys added + propagated 2026-06-19 (Hestia TXT, verified on all HE.net slaves + 8.8.8.8/1.1.1.1/9.9.9.9):

performancewest.net TXT yahoo-verification-key=IMx+OO5aKUE1nu9JwP6eSBMfSYZu8VcXjpkvEVXS84w=
send.performancewest.net TXT yahoo-verification-key=Ps5hGjVxXgeQcLcxr671YG0/RxzjjL0eqh6vfULubEo= (added alongside the existing send SPF record; both TXT coexist).

✅ DMARC aggregate reports — DONE 2026-06-19 (dedicated mailbox + parser)

Gmail/Yahoo/Microsoft + dozens of operators (Comcast, Cox, Bell, Mimecast, Cisco ESA, GMX, mail.com, gosecure, ...) send daily per-IP auth+disposition XML to dmarc@performancewest.net (DMARC record: p=reject; rua=mailto:dmarc@; ruf=mailto:dmarc@; fo=1). That mailbox was REJECTING (5.1.1) until 2026-06-19 — we silently lost every report. Now fully wired:

Dedicated mailbox. dmarc@performancewest.net is its own Carbonio account (was a DL -> ops@, which buried ops@ under report XML). Isolated IMAP credential in the server .env (DMARC_IMAP_{HOST,PORT,USER,PASS}), surfaced to the workers container in docker-compose.yml (mirrors the OPS_IMAP_* pattern). The 29 historical reports that had landed in ops@ were moved over via IMAP.
Parser worker. scripts/dmarc_report_parser.py IMAP-fetches unseen messages, decompresses the .gz/.zip/.xml attachment (namespace-agnostic — handles both the classic and the urn:ietf:params:xml:ns:dmarc-2.0 GMX/mail.com schema), parses the aggregate XML, and upserts one dmarc_report row (keyed (org_name, report_id), so re-parsing is a no-op) + one dmarc_record row per source IP into the schema from api/migrations/102_dmarc_aggregate.sql. dmarc_pass = dkim_aligned=pass OR spf_aligned=pass. Marks each message \Seen so each run only handles new reports. Flags: --dry-run, --all (backfill seen), --alert (7-day per-IP summary + Telegram if one of OUR IPs drops below 95% pass, or an EXTERNAL IP sends >=20 failing msgs as us = spoofing under p=reject).
Cron. /etc/cron.d/pw-dmarc-parser (tracked at infra/cron/pw-dmarc-parser) runs ... workers python3 -m scripts.dmarc_report_parser --alert daily at 06:20 UTC.

Query examples once populated:

-- who sends as us, and are they aligning? (the payoff of the DKIM/subdomain fixes)
SELECT source_ip, sum(msg_count) total,
       sum(msg_count) FILTER (WHERE dmarc_pass) pass,
       round(100.0*sum(msg_count) FILTER (WHERE dmarc_pass)/sum(msg_count)) pass_pct
FROM dmarc_record r JOIN dmarc_report rep ON rep.id=r.report_id
WHERE rep.date_begin >= now()-interval '7 days'
GROUP BY source_ip ORDER BY total DESC;
-- any UNKNOWN IP failing alignment = spoofing/forgotten relay (reputation poison)

Ongoing hygiene (reduce reputation damage)

Dead-address scrub: ~110 genuine 5.1.1 user unknown bounces/day. listmonk already blocklists hard bounces after 1 (bounce.actions hard->blocklist), so these self-clean, but pre-scrubbing the dirtiest segments before send avoids the reputation hit. See data/ segment exports.
Consumer-domain exclusion (two layers). The authoritative list lives in scripts/_email_exclusions.py (BLOCKED_EMAIL_DOMAINS): gmail/google, the full Yahoo/Verizon-Media family, Microsoft consumer, Apple/iCloud (added 2026-06-19), dead/legacy ISPs, and the legal do-not-contact list.
1. NEW selections: the per-vertical builders filter it out of audience SQL and listmonk_import.py refuses to import a blocked address.
2. Already-imported subs: LIST-BASED campaigns (FCC Direct Contacts list 3, CRTC/USF blasts) can still hit consumer subs imported BEFORE a domain joined the list. scripts/scrub_listmonk_consumer.py reconciles the live subscriber table against the exclusion list and blocklists any ENABLED match (idempotent; --dry-run supported; both listmonk + listmonk_hc). Runs daily 06:30 UTC via /etc/cron.d/pw-listmonk-scrub (tracked at infra/cron/pw-listmonk-scrub). First run 2026-06-19 blocklisted 7,943 trucking + 21 HC stale consumer subs (1,321 iCloud, 267 gmail, etc.) that were leaking via the running CRTC campaign. Re-run the scrub whenever you add a domain to the exclusion list.
Don't re-expand IPs until Postmaster Tools shows recovered reputation.
Volume discipline: keep the global 200/hr sliding window until reputation is green; concentrated low volume on one warm IP beats bursts.
Watch the rejection mix: 5.7.1 reputation/spam/blocked should fall over the next 1-2 weeks as the single-IP reputation builds. Track via: ssh ... 'sudo grep status=bounced /var/log/mail.log | grep -c 5.7.1'

22 KiB Raw Blame History