Commit graph

16 commits

Author SHA1 Message Date
justin
8e5590b492 mail: DMARC aggregate-report parser + dedicated dmarc@ mailbox ingestion
Tool 2 of the deliverability monitoring pair (Tool 1 = mail_reputation_monitor).
DMARC rua reports from dozens of operators (Google, Yahoo, Comcast, Cox, Bell,
Mimecast, Cisco ESA, GMX, mail.com, ...) were landing in ops@ (dmarc@ was a DL),
burying real mail and never parsed. Now ingested + queryable:

- dmarc@performancewest.net converted DL -> dedicated Carbonio mailbox; isolated
  IMAP creds in server .env, surfaced to workers in docker-compose.yml (mirrors
  OPS_IMAP_*). 29 historical reports moved ops@ -> dmarc@ via IMAP.
- scripts/dmarc_report_parser.py: IMAP fetch unseen -> decompress .gz/.zip/.xml
  (namespace-agnostic: classic + urn:ietf:params:xml:ns:dmarc-2.0 GMX/mail.com) ->
  parse aggregate XML -> upsert dmarc_report (keyed (org_name,report_id), no-op on
  re-parse) + dmarc_record per source IP. dmarc_pass = dkim_aligned OR spf_aligned.
  Marks \Seen. --dry-run/--all/--alert (7d per-IP summary + Telegram if one of OUR
  IPs <95% pass, or EXTERNAL IP sends >=20 failing msgs as us = spoofing under
  p=reject). psycopg2 imported lazily so --dry-run runs without the driver.
- api/migrations/102_dmarc_aggregate.sql: dmarc_report + dmarc_record tables.
- infra/cron/pw-dmarc-parser: 06:20 UTC daily --alert (after reputation, before scrub).
- docs/deliverability.md: DMARC section DONE; query examples.

Verified: dry-run --all parses all 28 reports (1 non-report test probe), 0 unknown
after the namespace fix.
2026-06-19 08:50:20 -05:00
justin
b45332b5f7 infra(cron): nightly mail-reputation snapshot (pw-mail-reputation)
Runs mail_reputation_monitor --alert at 06:10 UTC, piping the day's postfix log
(sudo cat, same pattern as pw-warmup-tg-alert) into the DB-connected workers
container. Builds the daily SNDS-equivalent reputation trend and Telegram-alerts
on operator regressions. Installed to /etc/cron.d/pw-mail-reputation.
2026-06-19 08:38:35 -05:00
justin
72c69a05c9 infra(cron): daily Listmonk consumer-domain reconciliation (pw-listmonk-scrub)
Runs scrub_listmonk_consumer against both listmonk and listmonk_hc at 06:30 UTC,
before the campaign builders, so any ENABLED subscriber matching the authoritative
exclusion list is blocklisted retroactively. Keeps list-based campaigns (FCC
Direct Contacts, CRTC/USF, etc.) from leaking onto consumer mailboxes after a new
domain (e.g. Apple/iCloud) is added to the exclusion list. Installed to
/etc/cron.d/pw-listmonk-scrub on the host.
2026-06-19 00:00:46 -05:00
justin
899b880e7f trucking: weekly FMCSA source refresh so new non-compliant carriers are caught
The FMCSA census was a one-time snapshot (last loaded ~May 30) with NO refresh
timer -- carriers newly falling out of MCS-150/UCR compliance were never picked
up. New scripts/workers/fmcsa_source_refresh.py orchestrates the full pipeline
(census download -> enrichment -> deficiency flag -> verify new emails ->
MX-tag new) and runs weekly via cron pw-fmcsa-refresh (Sun 09:00 UTC), codified
in the mail-pipeline Ansible role.

Idempotent + incremental: the census upsert preserves email_verified /
listmonk_sent_at / deficiency_flags, so existing carriers keep their send state
and only census fields refresh; new DOTs flow into verification then campaigns.
A carrier who refiled gets a fresh mcs150_parsed, so the builder's overdue
WHERE clause stops targeting them automatically. Verify is capped per run
(20k) so it never stalls on millions of rows.

(Healthcare already auto-catches newly-revalidation-overdue providers within
its 63k institutional pool via pw-hc-refresh Mon/Wed/Fri.)
2026-06-17 20:44:54 -05:00
justin
4dc5690666 infra: codify the email-campaign pipeline in Ansible (new mail-pipeline role)
The entire outbound campaign pipeline lived ONLY on the host and was never in
IaC -- a fresh rebuild would have silently shipped NO campaigns, NO IP warmup/
ramp, and NO bounce processing. New mail-pipeline role + deploy-mail-pipeline.yml
playbook deploy it from the canonical repo copies:

  cron.d (infra/cron/):
    - pw-trucking-campaign-builder, pw-ifta-campaign, pw-ucr-campaign
    - pw-hc-campaign, pw-hc-nppes, pw-hc-refresh
    - pw-mta-warmup, pw-listmonk-rampcap, pw-hc-rampcap
    - pw-ip-rehab, pw-warmup-tg-alert
  helper scripts (-> /usr/local/bin):
    - pw-mta-warmup, pw-listmonk-rampcap, pw-hc-rampcap, pw-warmup-tg-alert
    - postfix-bounce-notify.sh, postfix-hc-bounce-notify.sh, listmonk-bounce-sync.py
  systemd services:
    - pw-bounce-watcher.service (was missing from repo), pw-hc-bounce-watcher.service

Also creates the deploy-owned {{project_dir}}/logs dir (deploy can't write
/var/log, so a missing dir made cron redirects fail). Added the 6 cron.d files
that existed only on the host, the trucking bounce-watcher unit, and synced
infra/cron/pw-hc-refresh to the live version (revalidation download + enrich
steps). Role wired into site.yml after the mail (OpenDKIM) role.

Part of the email-deliverability incident hardening.
2026-06-17 20:26:01 -05:00
justin
2caab6aa69 hc: warmup must run DAILY for the full 21-day ramp (not weekdays-only)
The HC warmup crons were '* * 1-5' (Mon-Fri), silently skipping weekends -- but a
proper warmup needs CONTINUOUS daily volume for 21 days (mailbox providers reward
consistency; gaps stall reputation). The Jun 14 'HC 0 sent' alert was just a
skipped Sunday, but the weekend skips also broke ramp continuity.

- pw-hc-campaign + pw-hc-nppes: '* * 1-5' -> '* * *' (daily), vendored + applied live.
- Re-aligned the warmup start stamp from calendar-day 9 to send-day 5 so the
  volume ramp matches reputation actually built (it had skipped ~4 weekend days,
  running the ramp ahead of real history).
- Fixed the stale 'Mon-Fri only' comment in daily_slice().
- Vendored nppes cron now carries the enriched-CSV + 4-segment config.
2026-06-14 21:02:08 -05:00
justin
ff4ab262a8 hc: cron to feed NPPES institutional base (63k verified) into warmup, MX-throttled
Adds /etc/cron.d/pw-hc-nppes (weekdays 07:30) that imports the verified NPPES
institutional general-compliance base into the OIG screening segment, throttled
per MX operator. Separate from the 07:00 reval-segment run so the two pipelines
stay independent. Vendored the cron file under infra/cron/.
2026-06-12 22:11:12 -05:00
justin
25f4a7503b warmup: IP rehab for .91-.93 so they can be reallocated
The 3 IPs (mta02-04 / .91-.93) retired after the May 30-31 over-volume blast are
NOT on any DNSBL (Spamhaus/Barracuda/SpamCop/SORBS all clean) and have clean PTRs
+ SPF/DKIM/DMARC -- the damage was provider-internal reputation, which recovers
with slow clean sending. scripts/ip_rehab.py sends a tiny ramping trickle
(10/IP/day -> cap 60) of genuine CAN-SPAM-compliant compliance check-in mail to
clean business-domain, never-bounced recipients via dedicated heavily-throttled
postfix transports rehab02/03/04 (30s/msg, bound to .91/.92/.93). Routing uses an
X-PW-Rehab-IP header + header_checks FILTER to override the transport_maps randmap
warmup rotation (verified: mail routes via rehab transports, status=sent). Daily
cron pw-ip-rehab. After ~2-3 weeks of clean sending the IPs can be reallocated.
2026-06-09 20:27:47 -05:00
justin
9fa2c86f01 fix(warmup): HC cron logged to /var/log (deploy can't write) -> cron silently died
The HC warmup builder ran from cron at 07:00 but the >> /var/log/pw-hc-campaign.log
redirect failed (deploy user cannot write /var/log), and a failed output redirect
makes cron abort the command BEFORE it runs -> HC sent 0/day since the log file was
removed. Route HC cron logs to /opt/performancewest/logs/ (deploy-owned) so the
redirect always succeeds. Builder itself was fine (verified: imports + sends work,
0 bounces). Also removed the stale 'campaign-warmup.sh 122' root-cron line that
pointed at a finished campaign + no longer existed.
2026-06-09 16:06:28 -05:00
justin
7c39a858cc monitoring: daily warmup IP-reputation Telegram alert
End-of-day (20:00 Central) check of campaign deliverability across both sending
pools (main out05-09 + healthcare hcout). Sends a Telegram alert ONLY when there
is a reputation problem -- delivery below 65% or a spam/policy-block (550-5.7.1)
spike above 150/day -- so healthy days stay silent. Reuses the existing
TELEGRAM_BOT_TOKEN/CHAT_ID from /opt/performancewest/.env. Logs every run to
/var/log/pw-warmup-healthcheck.log for history. Excludes internal/probe noise so
the delivery figure reflects real external recipients.
2026-06-08 21:06:41 -05:00
justin
2156a5e05f hc refresh: run Mon/Wed/Fri instead of weekly to shrink CMS data-lag
The 'already revalidated' replies come from the CMS data-lag window (a provider
completes their revalidation but CMS's public Due Date List still shows them
overdue for weeks). Running the refresh 3x/week instead of weekly shrinks that
window from up to 7 days to ~2-3, so a provider who just completed stops being
targeted faster. No change to the overdue window or audience size -- this is the
lever that reduces stale-data complaints without losing prospects.
2026-06-08 10:53:36 -05:00
justin
9cb10b18e0 feat(hc): deliverability prune -- evict newly-Google-hosted subscribers
Belt-and-suspenders for the edge you flagged: a domain already in a warmup list
could flip its MX to Google Workspace between weekly refreshes, after which it
would hard-bounce from the cold IP. The import-time guard only catches NEW adds.

- prune_holdouts(): enumerates each warmup list's subscribers, matches them
  against the FRESH master CSV (re-classified weekly), and removes any whose
  domain is now Google-hosted. DELIVERABILITY-ONLY -- it never evicts for
  audience reasons (an overdue provider drifting out of the 1-90 day window was
  a valid target when warmed; re-litigating that just wastes warmup progress).
- --prune (run alongside warming) and --prune-only (prune then exit).
- Wired into the weekly refresh cron as a --prune-only chained step, so MX is
  re-checked and holdouts removed every Monday before the weekday sends.

Verified end-to-end: with no Google domains in lists it's a 0-op; injecting a
simulated Google-flipped domain into the master, the prune correctly detects and
(in a real run) would remove it from every list it's on.
2026-06-08 03:39:56 -05:00
justin
feb677f6ce fix(hc warmup): only mail slightly-overdue providers (deliverability)
Mailing heavily-overdue NPIs (months/years past due) risks hitting practices
that have closed, merged, or abandoned the inbox -> hard bounces, which are the
fastest way to wreck a warming IP's reputation. The warmup now restricts the
reval_overdue selector to an inclusive [HC_OVERDUE_MIN, HC_OVERDUE_MAX] window
(default 1-90 days) and the OIG 'any' selector likewise excludes heavily-overdue
and dropped-off-list rows. On the current cohort this trims the overdue audience
178->96 and the OIG audience 399->317, holding out the stale long tail
(181-365d + 366d+). upcoming/active providers are unaffected.
2026-06-08 03:27:22 -05:00
justin
167c4a3847 infra/cron: multi-segment hc warmup + weekly data-refresh cron
Tracks the deployed cron.d files in the repo:
- pw-hc-campaign: updated comment to reflect the now multi-segment warmup
  (revalidation + OIG + NPPES + reactivation + bundle); command unchanged.
- pw-hc-refresh (NEW): Mon 06:00 Central weekly data refresh, ~1h before the
  07:00 weekday send, so every send uses fresh CMS/OIG status.
2026-06-08 03:15:47 -05:00
justin
95698852ce healthcare warmup: gate Google/Workspace domains out of week 1 (they hard-reject cold IPs 550-5.7.1); send 501 non-Google practice domains first, defer 222 Google to week 2-3; cron uses hc_warmup_nongoogle.csv 2026-06-06 04:02:00 -05:00
justin
2bc86268f7 healthcare: HC warmup campaign cron (Mon-Fri 7AM Central) - imports overdue-first verified slice into listmonk-hc + runs Medicare-revalidation campaign via hc HOT stream; rate-throttled by pw-hc-rampcap 2026-06-06 03:57:08 -05:00