Tool 2 of the deliverability monitoring pair (Tool 1 = mail_reputation_monitor).
DMARC rua reports from dozens of operators (Google, Yahoo, Comcast, Cox, Bell,
Mimecast, Cisco ESA, GMX, mail.com, ...) were landing in ops@ (dmarc@ was a DL),
burying real mail and never parsed. Now ingested + queryable:
- dmarc@performancewest.net converted DL -> dedicated Carbonio mailbox; isolated
IMAP creds in server .env, surfaced to workers in docker-compose.yml (mirrors
OPS_IMAP_*). 29 historical reports moved ops@ -> dmarc@ via IMAP.
- scripts/dmarc_report_parser.py: IMAP fetch unseen -> decompress .gz/.zip/.xml
(namespace-agnostic: classic + urn:ietf:params:xml:ns:dmarc-2.0 GMX/mail.com) ->
parse aggregate XML -> upsert dmarc_report (keyed (org_name,report_id), no-op on
re-parse) + dmarc_record per source IP. dmarc_pass = dkim_aligned OR spf_aligned.
Marks \Seen. --dry-run/--all/--alert (7d per-IP summary + Telegram if one of OUR
IPs <95% pass, or EXTERNAL IP sends >=20 failing msgs as us = spoofing under
p=reject). psycopg2 imported lazily so --dry-run runs without the driver.
- api/migrations/102_dmarc_aggregate.sql: dmarc_report + dmarc_record tables.
- infra/cron/pw-dmarc-parser: 06:20 UTC daily --alert (after reputation, before scrub).
- docs/deliverability.md: DMARC section DONE; query examples.
Verified: dry-run --all parses all 28 reports (1 non-report test probe), 0 unknown
after the namespace fix.
Two deliverability hardening fixes from the email audit:
1. Plaintext (altbody): all campaigns were HTML-only. Listmonk only emits
multipart/alternative when altbody is set, and HTML-only bulk mail is a
spam-score signal. New scripts/_email_plaintext.py renders a readable
text/plain part from the HTML body (dependency-free; preserves Listmonk
{{ .Subscriber }}/{{ UnsubscribeURL }} template tags, turns links into
'text (url)'). Wired into the trucking builder (and thus UCR + IFTA, which
reuse create_and_schedule_campaign) and the healthcare builder.
2. Stable container hostname: Listmonk derived its Message-ID from the random
docker container id -> @localhost.localdomain (spam-score signal). Pin both
listmonk + listmonk-hc hostname to perfwest.performancewest.net, matching
Listmonk's SMTP hello_hostname.
Part of the email-deliverability incident hardening.
Root cause of recurring 'Password not found for Email Account Performance West
Outgoing': the account was shipped as a fixture with awaiting_password=1 and no
password. Email Account SMTP passwords are encrypted per-site and cannot live in
a fixture, so every `bench migrate` reimported the fixture and re-broke
outgoing mail (login notifications, password resets, welcome emails).
- Remove the Email Account fixture (it cannot carry the encrypted secret).
- Add email_account_sync.sync_outgoing_password: idempotent, exception-safe
upsert that reconciles the account + password from SMTP_* env and clears
awaiting_password.
- Wire it to after_migrate (repairs at end of every deploy/migrate, right after
fixtures import) and the daily scheduler (heals out-of-band restore/restart
drift).
- Pass SMTP_* into the erpnext + erpnext-scheduler containers so the sync has
the secret (they previously had no SMTP env).
Isolated from the trucking listmonk: own DB (listmonk_hc), own uploads volume,
own sliding-window cap. Configured (on dev) with 3 SMTP servers pointing at the
host Postfix hc submission ports 2526/2527/2528 so it round-robins the dedicated
hc IPs .107/.108/.109. Reaches the host MTA via the docker bridge gateway.
Note: the listmonk image needs an explicit one-time '--install --idempotent
--yes' against listmonk_hc (env vars alone do not auto-install this image tag).
Validated on dev: listmonk-hc container (172.19.0.16) -> host :2526 (hcsubmit107)
-> hcout1 (.107) -> real gmail MX; both listmonk instances Up.
Chromium rejects authenticated SOCKS5 ('Browser does not support socks5 proxy
authentication'). Add a gost (ginuerzh/gost:2.11.5) 'proxy-relay' sidecar that
listens unauthenticated on socks5://proxy-relay:11080 and forwards to the
authenticated residential upstream (HEALTHCARE_PROXY_UPSTREAM_URL). Workers point
Playwright at the relay via HEALTHCARE_PROXY_URL=socks5://proxy-relay:11080.
env template: split into HEALTHCARE_PROXY_UPSTREAM_URL (authenticated, password
percent-encoded so '#' -> %23) and HEALTHCARE_PROXY_URL (the relay address).
Validated end-to-end on dev: workers Chromium -> proxy-relay -> residential
egress IP 76.228.206.147; NPPES + PECOS both HTTP 200.
CMS healthcare portals (NPPES, PECOS, I&A) block datacenter IPs, so the
healthcare browser automation needs to egress via the residential proxy on
hg409y7ez04.sn.mynetname.net (username 'performancewest').
- undetected_browser: use_proxy now accepts an env-var name, so callers can
select a domain-specific proxy. _proxy_config(proxy_env) reads it and falls
back to UNDETECTED_PROXY_URL. Healthcare uses 'HEALTHCARE_PROXY_URL'.
- probe_npi_undetected: launches with use_proxy='HEALTHCARE_PROXY_URL' when set.
- npi_provider: documents that the (future) automated NPPES/PECOS flows must
use the healthcare proxy.
- Plumb HEALTHCARE_PROXY_URL (+ UNDETECTED_PROXY_URL fallback) through the
ansible env template and docker-compose workers env.
The credential itself is NOT in the repo. Set the full URL in the ansible
vault as vault_healthcare_proxy_url:
socks5://performancewest:<password>@hg409y7ez04.sn.mynetname.net:<port>
Verified parsing + Playwright proxy-dict wiring with a unit test.
The erpnext service was missing both env vars that the portal needs:
- CUSTOMER_JWT_SECRET: verifies /set-password magic-link tokens signed by the
API. Without it, the set-password page resolved an empty/placeholder secret
and showed 'Link invalid' for every customer onboarding link.
- DATABASE_URL: lets www/orders.py read compliance_orders from Postgres for the
portal's Compliance section.
Both were present on api/workers but never wired to erpnext -> drift. Now the
single ERPNext portal can actually verify invites and show compliance orders.
Sends to the monitoring bot immediately when payment is confirmed:
- Customer name and email
- Service/slug ordered
- Total amount (includes all fees: service + formation + state + addons)
- Payment method
- Order number and type
Fire-and-forget — never blocks the payment flow.
Requires TELEGRAM_BOT_TOKEN and TELEGRAM_CHAT_ID env vars on API container.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dashboard queries now use max() to pick UP value when old stale
probe targets coexist with new ones. Prometheus admin API enabled
for future TSDB cleanup of stale series.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ERPNext: custom blackbox module with Host: performancewest.net header
(ERPNext multitenancy requires site name in Host for routing)
- Forgejo: add extra_hosts to blackbox-exporter so it can resolve
host.docker.internal to reach forgejo on port 3000
- Blackbox http_erpnext module: sets Host header, expects 200
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
host network mode prevented Prometheus from reaching the exporter.
Switched back to bridge with extra_hosts + explicit port mapping.
Added timeout flag to prevent hanging on stub_status fetch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nginx-exporter couldn't reach host nginx via host.docker.internal
(connection timeout). Switch to network_mode: host so it can access
127.0.0.1:8888 directly. Prometheus scrapes via host.docker.internal
with extra_hosts mapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- nginx stub_status moved to port 8888 (port 80 was being caught
by other server blocks and returning 301)
- nginx-exporter updated to scrape :8888
- Added alertmanager scrape job to Prometheus config (was missing,
so alertmanager dashboard had no data)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>