# Email Deliverability Runbook

**Owner action items are marked 🔴 MANUAL. Everything else is already done/automated.**

Last updated: 2026-06-19 (bulk subdomain + SPF trim + Microsoft/audience analysis).

---

## TL;DR of the 2026-06-18/19 deliverability incident

- **Symptom:** ~30% "open" rates but **0 human clicks, 0 sales** across both trucking
  and healthcare streams.
- **Root cause:** NOT a blocklist, NOT the IPs. Proven by a controlled A/B test
  (2026-06-19): from the **same mail server / same IPs**, a message From
  `justin@carrierone.com` landed in the **Inbox** while From
  `justin@performancewest.net` went to **Junk**. The variable is the **From
  domain's reputation**. `carrierone.com` (reg. 2006, years of steady low-volume
  mail, tight 2-IP SPF) is trusted; `performancewest.net` (only started bulk in
  ~May 2026, broken DKIM until 2026-06-17, 21-IP snowshoe SPF, May 30-31
  over-volume blast) is cold/damaged.
- **Where the audience actually is (24h receiver mix):** **~85% Microsoft**
  (M365/Outlook/Hotmail), ~14% Google, <1% Yahoo. Our list is B2B, so Microsoft
  is the game, not Gmail. **Microsoft is NOT reputation-blocking us** (only ~1.6%
  5.7.x/S3150 rejects; it accepts ~2,138 msgs/24h) — but acceptance != inbox, so
  the engagement problem there is likely Junk-foldering, same domain-reputation
  cause. Gmail rejects ~95% of its (smaller) slice on `550-5.7.1 ... very low
  reputation of the sending domain`. The single biggest bounce bucket is actually
  **list hygiene**: ~1,012/24h Microsoft `451 4.4.4 no mail-enabled subscriptions`
  (dead tenant domains) + dead recipients.
- **Fixes applied (2026-06-18/19):**
  1. Consolidated to ONE IP per stream (snowshoe was a band-aid for broken DKIM).
  2. **Dedicated bulk subdomain** `send.performancewest.net` so bulk reputation is
     isolated from the root domain (which stays clean for transactional mail).
  3. Trimmed root SPF from 21 IPs to the real 3 (the bloated record was itself a
     snowshoe signal).
  4. Disabled the pointless `pw-ip-rehab` cron (we have no IP reputation problem).

---

## Bulk subdomain: send.performancewest.net (2026-06-19)

**Why:** isolate bulk/cold-campaign sending reputation from the root domain. The
root domain carries transactional/verification/receipt mail (via co.carrierone.com
relay + the .71 default egress) and must stay clean; cold campaigns are inherently
reputation-risky. Industry-standard (SendGrid/Mailchimp/etc.) split.

**Customer experience is unchanged:** From is the subdomain, but **Reply-To stays
`info@performancewest.net`**, so replies land in the real inbox and look normal.

| Piece | Value |
|-------|-------|
| Trucking From | `Performance West <noreply@send.performancewest.net>` |
| Healthcare From | `Performance West Compliance <compliance@send.performancewest.net>` |
| Reply-To (both) | `info@performancewest.net` |
| DKIM selector | `send` (`send._domainkey.send.performancewest.net`), 2048-bit |
| SPF | `v=spf1 ip4:207.174.124.94 ip4:207.174.124.107 -all` |
| DMARC | inherits root `p=reject` (explicit `_dmarc.send` also published) |
| MX / Return-Path | `co.carrierone.com` (bounces) |
| Egress IPs | .94 (trucking) / .107 (HC) — unchanged |

**Code:** `from_email` is set in `scripts/build_trucking_campaigns.py` (`FROM_EMAIL`,
env `CAMPAIGN_FROM`) and `scripts/build_healthcare_campaigns_cron.py` (`FROM_EMAIL`,
env `HC_CAMPAIGN_FROM`). Bounce-watchers (`scripts/bounce-watcher.sh`,
`scripts/hc-bounce-watcher.sh`) track the new subdomain sender (and keep the legacy
root sender so the pre-cutover queue drains).

**Infra:** OpenDKIM signs both domains — see `infra/ansible/roles/mail`
(`opendkim_signing_domains` list generates per-domain keys + KeyTable/SigningTable).
DNS published on the Hestia master (see DNS automation note below). Verified
end-to-end 2026-06-19: a test send signs `d=send.performancewest.net; s=send;` and
egresses out05/.94.

**Listmonk global `app.from_email`** was also updated in both DBs as a fallback for
any UI/test send that doesn't set From explicitly.

> ⚠️ The subdomain starts at NEUTRAL reputation (not negative, not warm). It still
> needs the same warm-up discipline: steady low volume to engaged recipients. It is
> NOT a magic reset — but it protects the root domain and starts cleaner than the
> damaged root.

---

## Sending architecture (after 2026-06-18/19 consolidation)

| Stream | IP | PTR / HELO | Path |
|--------|----|-----------|----|
| **Trucking** (listmonk) | **207.174.124.94** | mta05.performancewest.net | listmonk -> :25 -> `randmap:{out05:}` |
| **Healthcare** (listmonk-hc) | **207.174.124.107** | hcmta01.performancewest.net | listmonk-hc SMTP server 1 -> :2526 -> hcout1 |
| Transactional / verification | 207.174.124.71 + co.carrierone.com (.15) | perfwest | default `smtp_bind_address` (.71) + :587 relay (.15) |
| Removed 2026-06-23 (snowshoe cleanup) | .90-.93, .95-.106, .108-.109 | mta01-04/06-17, hcmta02-03 | transports + host IP bindings DELETED |

**Snowshoe IP cleanup (2026-06-23):** the 18 dormant sending IPs (.90-.93,
.95-.106, .108-.109) were fully removed from BOTH postfix (`master.cf`
transports `yahooslow`/`out02-04`/`out06-20`/`rehab02-04`/`2527`/`2528`/
`hcout2`/`hcout3`) AND the host (`/etc/network/interfaces` + live `ip addr del`).
Only the two warm sending IPs (.94 trucking, .107 HC) plus infra (.71/.72)
remain bound. A 20-IP footprint reads as snowshoe spam and was hurting domain
reputation; the SPF was already trimmed to .94/.107 on 2026-06-19, so this just
makes the host/postfix match the SPF intent. Verified live: `postfix check` OK,
both streams still `status=sent` post-change, SSH unaffected. Reference snapshots
committed at `infra/postfix/live-snapshots/master.cf` + `infra/network/interfaces`
(live backups `/root/master.cf.bak_snowshoe_*` + `/root/interfaces.bak_snowshoe_*`).

**Root SPF (trimmed 2026-06-19):** `v=spf1 a mx ip4:207.174.124.15
ip4:207.174.124.94 ip4:207.174.124.107 -all` — `a`=.71, `mx`=co.carrierone.com(.15),
plus the two bulk IPs. The old 21-IP record was a snowshoe signal; this matches
carrierone.com's tight style.

**To re-expand after reputation is established:** add transports back to `ALL=()`
in `infra/postfix/pw-mta-warmup.sh` and re-enable the HC SMTP servers (ports
2527/2528) in the `listmonk_hc` DB `settings.smtp`. Re-expand SLOWLY (one IP at a
time, days apart) and only after Postmaster Tools shows a green/medium reputation.
If you re-expand, also add the IPs back to BOTH the root SPF and the `send`
subdomain SPF.

---

## Resuming Gmail sends: the stale-Date / inbox-burial problem (READ BEFORE re-enabling Gmail)

**Status:** Gmail is currently EXCLUDED from all sends (`scripts/_email_exclusions.py`
`BLOCKED_EMAIL_DOMAINS` includes gmail/google). This section is the documented
procedure for when we resume Gmail, and the reasoning for the chosen design. It is
NOT yet implemented — implement it at the moment Gmail is re-enabled.

### The problem
We inject the whole daily batch into Postfix in a ~2.5h burst (today: 1,430 + 1,419
+ 1,077 messages in the 07:00-09:30 window, with a 932-in-one-minute spike at
08:30), then Postfix slow-drains the queue over ~24h because receivers throttle a
warming IP/domain (Microsoft `451 4.7.500 Server busy`).

**Listmonk stamps the `Date:` header at the moment it hands each message to Postfix
(injection time), NOT at delivery time.** Empirically verified 2026-06-23: a queued
message had `Date: 19:47:28` matching its Postfix arrival log line exactly, and was
still deferred ~4h47m later. So a message injected at 08:00 keeps an 08:00 `Date:`
even when the receiver finally accepts it at 14:00.

**Why this matters ONLY for Gmail:** inbox sort order depends on the client.
- **Outlook / Exchange / M365** (our current #1 audience, ~2,000 delivered/day) and
  most webmail (Proton, etc.) sort by **received time** (`PR_MESSAGE_DELIVERY_TIME`)
  = when THEIR server accepted it. A late-delivered message surfaces fresh at the
  top on arrival; only the *displayed* date looks old. So for today's audience the
  burial is cosmetic and NOT worth fixing.
- **Gmail sorts the inbox by the `Date:` header.** A message accepted at 14:00 but
  Date-stamped 08:00 is filed **6h down** the inbox, below mail the user has already
  read. That is real burial and real lost opens — and it only bites once we send
  Gmail again (which is ~85% Microsoft / ~14% Google for our B2B list, so Gmail is
  a meaningful slice).

### Why NOT to future-date / spoof the `Date:` header
The tempting "just stamp a future Date" fix is a net negative:
1. **Spam signal.** A `Date:` in the future is a classic filter heuristic —
   Proofpoint, Mimecast, and Microsoft all penalize it. We'd trade a cosmetic
   timestamp for WORSE inbox placement.
2. **It breaks our DKIM.** OpenDKIM signs the `Date` header (only `From` is
   over-signed, but `Date` is in the signed set). Rewriting `Date` after signing
   invalidates the signature -> DMARC `p=reject` -> hard bounce.
3. **It doesn't even help Outlook** (received-time sort) and is the wrong lever for
   Gmail (see the real fix below).

### The fix: pace Listmonk INJECTION to match Gmail's accept rate (just-in-time Date)
Because `Date:` is stamped at injection, the solution is to **release each Gmail
message close to when Gmail will actually accept it**, so `Date:` ≈ received time ≈
now, and it lands at the top of the Gmail inbox. Keep the Postfix queue shallow for
the Gmail stream so no message sits for hours collecting a stale Date.

Implementation when re-enabling Gmail:
1. **Segment Gmail into its OWN Listmonk campaign on its OWN single IP** (snowshoe-
   safe), separate from the Microsoft/Proofpoint stream, so its deliberately slow
   pace does not bottleneck the fast stream. Each stream gets its own injection
   cadence. (Add the new IP to host + Postfix transport + BOTH SPF records first,
   per the re-expand note above.)
2. **Set the Gmail campaign's sliding-window injection rate at or below Gmail's
   sustained cold-domain accept rate** (`app.message_sliding_window_rate` /
   `_duration` on that Listmonk instance). Start low (~20-30/hr/IP for a cold
   domain) and ramp as Postmaster Tools reputation climbs. This spreads injection
   across the whole sending window instead of front-loading it, so the queue never
   builds a backlog of stale-dated Gmail mail.
3. **Queue-age guard.** Monitor the inject->deliver gap for the Gmail stream
   (`delay=` in the maillog). If it exceeds ~30 min, injection is outrunning
   acceptance -> throttle the sliding-window rate down further. Verify after a day
   that the Gmail stream's `delay=` stays small and the "6-24h late" bucket is ~0.

This is strictly better than date-spoofing: no spam signal, no DKIM break, and
because Gmail/Microsoft both reward steady paced volume, pacing injection also
RAISES the accept quota over time (the deliverability principle "concentrated low
volume beats bursts"). Win-win.

> Note: this same pacing slightly helps Outlook's *displayed* date too, but since
> Outlook sorts by received time it is not necessary there. Only spend the effort on
> the Gmail stream.

---

## DNS automation (Hestia is the master)

**DNS is fully automatable** — Hestia (`cp.carrierone.com`, 207.174.124.22) is the
DNS master; HE.net are slaves. Access: `ssh -p 22022 root@cp.carrierone.com` using
the **local workstation's** `~/.ssh/id_ed25519` (NOT the app server, NOT justin@
which is SFTP-only). The `justin` Hestia user owns the `performancewest.net` zone.

```
# add  (note: Hestia appends the base domain to the RECORD name, so a record at
#        send._domainkey.send.performancewest.net needs RECORD = "send._domainkey.send")
v-add-dns-record justin performancewest.net "<record>" <TYPE> "<value>" [prio]
# change / delete (find the numeric id with v-list-dns-records ... plain)
v-change-dns-record justin performancewest.net <id> "<record>" <TYPE> "<value>" "" yes <ttl>
v-delete-dns-record justin performancewest.net <id>
# list
v-list-dns-records  justin performancewest.net plain
```

Each write triggers a ~30s zone rebuild + DNSSEC re-sign; slaves sync via NOTIFY /
SOA refresh, usually within a minute. Verify on `@8.8.8.8` AND the master
`@207.174.124.22` (the master is authoritative; public resolvers may lag).

---

## Monitoring tools (set these up to SEE reputation directly)

These all require a provider account login + (for Google) a DNS TXT record on
HE.net, so they can't be fully automated. Steps are pre-filled below.

### 🔴 MANUAL 1 — Google Postmaster Tools (Gmail is our biggest blocker)
Gmail's verbatim rejection names "the sending **domain**", so this is priority #1.

**DNS is fully automatable** — Hestia (cp.carrierone.com) is the DNS master,
HE.net are slaves. Add records as root: `ssh -p 22022 root@cp.carrierone.com`
then `v-add-dns-record justin performancewest.net "@" TXT '"'"'"<value>"'"'"'`
(zone owner is the `justin` Hestia user; ~30s zone rebuild + slaves sync via the
2h SOA refresh / NOTIFY, usually within a minute).

Status 2026-06-18: **TXT added + verified live** (record id 14464,
`google-site-verification=p8s3RaN5wi81350wToMpdPMho5Gcel4RGT1Q1SXj7vg`),
resolving on 8.8.8.8/1.1.1.1/9.9.9.9 and 4/5 HE.net slaves. Owner just needs to
click **Verify** in the Postmaster console once. Data populates 24-48h after
volume flows from the consolidated IP.

To set up from scratch next time: postmaster.google.com -> +Add domain ->
performancewest.net -> copy the `google-site-verification=...` token -> add via
the Hestia command above -> Verify.

### ✅ MANUAL 2 — Microsoft SNDS + JMRP (Outlook/Hotmail/Live) — **DONE 2026-06-19**
**85% of our audience is Microsoft-hosted** (M365/Outlook/Hotmail), so this is the
single most important monitoring tool. Microsoft already *accepts* our mail (~1.6%
reputation rejects), so this tells us inbox-vs-junk + complaint rates.
SNDS is **IP-based** (register the sending IPs), JMRP is the complaint feedback loop.
**Both SNDS access and JMRP are now registered for 207.174.124.94 + .107.**

> **2026 URL MIGRATION:** Microsoft moved SNDS off
> `sendersupport.olc.protection.outlook.com`. The old `/snds/` and `/pm/` links now
> 308-redirect to the new app at **`substrate.office.com/ip-domain-management-snds/`**.
> The *footer/help* links on that page ("contact sender support", "Privacy",
> "Microsoft Services Agreement") go to generic `microsoft.com` pages — that is
> normal, they are boilerplate, NOT the broken task. **You must click "Log in"
> (top-right) with a personal Microsoft account FIRST**; until you authenticate the
> "Request Access" / "Junk Mail Reporting Program" links just bounce to
> `login.microsoftonline.com`, which looks like a dead redirect but is the expected
> auth step. After login the real forms render.

1. **SNDS — Request Access:** open the SNDS app — either the legacy entry
   <https://sendersupport.olc.protection.outlook.com/snds/> (it 308-redirects to the
   new app) or directly
   `https://substrate.office.com/ip-domain-management-snds/SNDS` — then **Log in** ->
   left-nav **"Request Access"** (direct:
   `https://substrate.office.com/ip-domain-management-snds/SNDS/AddNetwork`) ->
   register IPs **207.174.124.94** and **207.174.124.107** (the two live stream IPs;
   add .90 and .71 if you want full coverage). Verification goes to a role address
   on the IP's domain (use `postmaster@` or `abuse@performancewest.net`, now live).
   (NOTE: `snds.microsoft.com` does NOT resolve — do not use it.)
   **✅ DONE 2026-06-19:** access requested/granted for .94 + .107. Data populates
   over ~24-48h; then check the dashboard for the per-IP RED/YELLOW/GREEN status,
   spam-trap hits, and complaint rate.
2. **JMRP:** same site, left-nav **"Junk Mail Reporting Program"** (direct:
   `https://substrate.office.com/ip-domain-management-snds/SNDS/Jmrp`) -> register
   the same IPs + complaint-destination mailbox **`fbl@performancewest.net`**.
   Complaints then arrive as ARF emails.
   **✅ DONE 2026-06-19:** both IPs registered as feeds — `pw1` = 207.174.124.94,
   `pw2` = 207.174.124.107, complaint destination set to **`fbl@performancewest.net`**
   (live, routes to ops@). ARF complaint reports now land there automatically.

**✅ PREREQ DONE (2026-06-19):** the role mailboxes Microsoft needs now exist and
deliver. Created as Carbonio distribution lists routing to `ops@performancewest.net`:
`postmaster@`, `abuse@`, `fbl@`, `dmarc@` — all verified ACCEPT at the MX +
delivered end-to-end. (They previously REJECTED with 5.1.1, which would have blocked
SNDS verification.) Use `postmaster@` or `abuse@` for SNDS verification and
`fbl@performancewest.net` as the JMRP complaint destination.

> Carbonio mail admin: `ssh -p 22022 justin@207.174.124.15` (the **co.carrierone.com**
> mail host; local workstation key, justin has NOPASSWD sudo). Run prov as zextras:
> `sudo -u zextras /opt/zextras/bin/carbonio prov <cmd>` (e.g. `gaa`, `gadl`,
> `cdl <addr>`, `adlm <dl> <member>`, `gdlm <dl>`).

### ✅ MANUAL 3 — Yahoo Complaint Feedback Loop — **keys added 2026-06-19**
Lowest priority (<1% of audience), but cheap. CFL is DKIM-d= based.
1. <https://senders.yahooinc.com/complaint-feedback-loop/> -> sign in -> register
   the domains `performancewest.net` **and** `send.performancewest.net` (CFL keys
   off the DKIM `d=` value; bulk mail now signs `d=send.performancewest.net`).
2. Set the complaint destination to `fbl@performancewest.net` (now live, see above).

**✅ ENROLLED 2026-06-19** — both domains show **Enrolled** in the Yahoo Sender Hub
CFL with reporting email `fbl@performancewest.net`:
- `performancewest.net` — Enrolled, reporting `fbl@performancewest.net`
- `send.performancewest.net` — Enrolled, reporting `fbl@performancewest.net`
(Reporting-email code was delivered to fbl@ → ops@ and verified; the Selector
column is intentionally blank = match any DKIM selector on the verified domain.)

**✅ DNS verification keys added + propagated 2026-06-19** (Hestia TXT, verified on
all HE.net slaves + 8.8.8.8/1.1.1.1/9.9.9.9):
- `performancewest.net` TXT `yahoo-verification-key=IMx+OO5aKUE1nu9JwP6eSBMfSYZu8VcXjpkvEVXS84w=`
- `send.performancewest.net` TXT `yahoo-verification-key=Ps5hGjVxXgeQcLcxr671YG0/RxzjjL0eqh6vfULubEo=`
  (added alongside the existing `send` SPF record; both TXT coexist).

### ✅ DMARC aggregate reports — DONE 2026-06-19 (dedicated mailbox + parser)
Gmail/Yahoo/Microsoft + dozens of operators (Comcast, Cox, Bell, Mimecast, Cisco
ESA, GMX, mail.com, gosecure, ...) send daily per-IP auth+disposition XML to
`dmarc@performancewest.net` (DMARC record: `p=reject; rua=mailto:dmarc@; ruf=mailto:dmarc@; fo=1`).
**That mailbox was REJECTING (5.1.1) until 2026-06-19 — we silently lost every
report.** Now fully wired:

1. **Dedicated mailbox.** `dmarc@performancewest.net` is its own Carbonio account
   (was a DL -> ops@, which buried ops@ under report XML). Isolated IMAP credential
   in the server `.env` (`DMARC_IMAP_{HOST,PORT,USER,PASS}`), surfaced to the workers
   container in `docker-compose.yml` (mirrors the `OPS_IMAP_*` pattern). The 29
   historical reports that had landed in ops@ were moved over via IMAP.
2. **Parser worker.** `scripts/dmarc_report_parser.py` IMAP-fetches unseen messages,
   decompresses the `.gz`/`.zip`/`.xml` attachment (namespace-agnostic — handles both
   the classic and the `urn:ietf:params:xml:ns:dmarc-2.0` GMX/mail.com schema), parses
   the aggregate XML, and upserts one `dmarc_report` row (keyed `(org_name, report_id)`,
   so re-parsing is a no-op) + one `dmarc_record` row per source IP into the schema from
   `api/migrations/102_dmarc_aggregate.sql`. `dmarc_pass = dkim_aligned=pass OR
   spf_aligned=pass`. Marks each message `\Seen` so each run only handles new reports.
   Flags: `--dry-run`, `--all` (backfill seen), `--alert` (7-day per-IP summary +
   Telegram if one of OUR IPs drops below 95% pass, or an EXTERNAL IP sends >=20 failing
   msgs as us = spoofing under `p=reject`).
3. **Cron.** `/etc/cron.d/pw-dmarc-parser` (tracked at `infra/cron/pw-dmarc-parser`)
   runs `... workers python3 -m scripts.dmarc_report_parser --alert` daily at 06:20 UTC.

Query examples once populated:
```sql
-- who sends as us, and are they aligning? (the payoff of the DKIM/subdomain fixes)
SELECT source_ip, sum(msg_count) total,
       sum(msg_count) FILTER (WHERE dmarc_pass) pass,
       round(100.0*sum(msg_count) FILTER (WHERE dmarc_pass)/sum(msg_count)) pass_pct
FROM dmarc_record r JOIN dmarc_report rep ON rep.id=r.report_id
WHERE rep.date_begin >= now()-interval '7 days'
GROUP BY source_ip ORDER BY total DESC;
-- any UNKNOWN IP failing alignment = spoofing/forgotten relay (reputation poison)
```

---

## Ongoing hygiene (reduce reputation damage)

- **Dead-address scrub:** ~110 genuine `5.1.1 user unknown` bounces/day. listmonk
  already blocklists hard bounces after 1 (`bounce.actions hard->blocklist`), so
  these self-clean, but pre-scrubbing the dirtiest segments before send avoids the
  reputation hit. See `data/` segment exports.
- **Consumer-domain exclusion (two layers).** The authoritative list lives in
  `scripts/_email_exclusions.py` (`BLOCKED_EMAIL_DOMAINS`): gmail/google, the full
  Yahoo/Verizon-Media family, Microsoft consumer, **Apple/iCloud (added 2026-06-19)**,
  dead/legacy ISPs, and the legal do-not-contact list.
  1. *NEW selections:* the per-vertical builders filter it out of audience SQL and
     `listmonk_import.py` refuses to import a blocked address.
  2. *Already-imported subs:* LIST-BASED campaigns (FCC Direct Contacts list 3,
     CRTC/USF blasts) can still hit consumer subs imported BEFORE a domain joined
     the list. `scripts/scrub_listmonk_consumer.py` reconciles the live subscriber
     table against the exclusion list and blocklists any ENABLED match (idempotent;
     `--dry-run` supported; both `listmonk` + `listmonk_hc`). Runs daily 06:30 UTC
     via `/etc/cron.d/pw-listmonk-scrub` (tracked at `infra/cron/pw-listmonk-scrub`).
     First run 2026-06-19 blocklisted **7,943** trucking + **21** HC stale consumer
     subs (1,321 iCloud, 267 gmail, etc.) that were leaking via the running CRTC
     campaign. Re-run the scrub whenever you add a domain to the exclusion list.
- **Don't re-expand IPs** until Postmaster Tools shows recovered reputation.
- **Volume discipline:** keep the global 200/hr sliding window until reputation is
  green; concentrated low volume on one warm IP beats bursts.
- **Watch the rejection mix:** `5.7.1 reputation/spam/blocked` should fall over the
  next 1-2 weeks as the single-IP reputation builds. Track via:
  `ssh ... 'sudo grep status=bounced /var/log/mail.log | grep -c 5.7.1'`