The HC warmup builder ran from cron at 07:00 but the >> /var/log/pw-hc-campaign.log
redirect failed (deploy user cannot write /var/log), and a failed output redirect
makes cron abort the command BEFORE it runs -> HC sent 0/day since the log file was
removed. Route HC cron logs to /opt/performancewest/logs/ (deploy-owned) so the
redirect always succeeds. Builder itself was fine (verified: imports + sends work,
0 bounces). Also removed the stale 'campaign-warmup.sh 122' root-cron line that
pointed at a finished campaign + no longer existed.
crypto.performancewest.net kept going down because the shkeeper-deployment web
pod periodically HANGS (HTTP server deadlocks while the apscheduler background
thread keeps the process alive). The helm chart (shkeeper-1.7.15) ships NO
liveness or readiness probe, so k8s saw the hung pod as Running and never
restarted it, and kept routing traffic to the dead backend -> site down until a
manual restart.
Added HTTP probes on / :5000 (302 = healthy): liveness auto-restarts a hung pod,
readiness pulls it from the Service endpoints. Applied live via kubectl patch
(chart does not expose probes via values; re-apply after any helm upgrade --
command in the file header). Verified: new pod comes up READY 1/1 (probe passes)
and crypto.performancewest.net serves 302 again.
End-of-day (20:00 Central) check of campaign deliverability across both sending
pools (main out05-09 + healthcare hcout). Sends a Telegram alert ONLY when there
is a reputation problem -- delivery below 65% or a spam/policy-block (550-5.7.1)
spike above 150/day -- so healthy days stay silent. Reuses the existing
TELEGRAM_BOT_TOKEN/CHAT_ID from /opt/performancewest/.env. Logs every run to
/var/log/pw-warmup-healthcheck.log for history. Excludes internal/probe noise so
the delivery figure reflects real external recipients.
The 'already revalidated' replies come from the CMS data-lag window (a provider
completes their revalidation but CMS's public Due Date List still shows them
overdue for weeks). Running the refresh 3x/week instead of weekly shrinks that
window from up to 7 days to ~2-3, so a provider who just completed stops being
targeted faster. No change to the overdue window or audience size -- this is the
lever that reduces stale-data complaints without losing prospects.
Belt-and-suspenders for the edge you flagged: a domain already in a warmup list
could flip its MX to Google Workspace between weekly refreshes, after which it
would hard-bounce from the cold IP. The import-time guard only catches NEW adds.
- prune_holdouts(): enumerates each warmup list's subscribers, matches them
against the FRESH master CSV (re-classified weekly), and removes any whose
domain is now Google-hosted. DELIVERABILITY-ONLY -- it never evicts for
audience reasons (an overdue provider drifting out of the 1-90 day window was
a valid target when warmed; re-litigating that just wastes warmup progress).
- --prune (run alongside warming) and --prune-only (prune then exit).
- Wired into the weekly refresh cron as a --prune-only chained step, so MX is
re-checked and holdouts removed every Monday before the weekday sends.
Verified end-to-end: with no Google domains in lists it's a 0-op; injecting a
simulated Google-flipped domain into the master, the prune correctly detects and
(in a real run) would remove it from every list it's on.
Mailing heavily-overdue NPIs (months/years past due) risks hitting practices
that have closed, merged, or abandoned the inbox -> hard bounces, which are the
fastest way to wreck a warming IP's reputation. The warmup now restricts the
reval_overdue selector to an inclusive [HC_OVERDUE_MIN, HC_OVERDUE_MAX] window
(default 1-90 days) and the OIG 'any' selector likewise excludes heavily-overdue
and dropped-off-list rows. On the current cohort this trims the overdue audience
178->96 and the OIG audience 399->317, holding out the stale long tail
(181-365d + 366d+). upcoming/active providers are unaffected.
Tracks the deployed cron.d files in the repo:
- pw-hc-campaign: updated comment to reflect the now multi-segment warmup
(revalidation + OIG + NPPES + reactivation + bundle); command unchanged.
- pw-hc-refresh (NEW): Mon 06:00 Central weekly data refresh, ~1h before the
07:00 weekday send, so every send uses fresh CMS/OIG status.
Standard (no-login) CMS filings are mailed in one Priority Mail envelope per
destination agency, batched each postal working-day morning to save postage.
- migration 089: paper_filing_batches table + esign_records.paper_batch_id /
filing_destination_key (idempotent: a filing is batched at most once).
- batch_cover_sheet.py: per-agency cover sheet (sender/dest/date/manifest) +
merged print-job PDF (cover + all enclosed signed filings).
- daily_paper_batch.py worker: gather signed+unbatched cms855/cms10114 filings,
group by destination (MAC by state via mac_routing; Fargo for CMS-10114),
build cover+merged PDF per agency, persist batch, mark filings batched.
Self-gates on postal working days (skips weekends + federal/USPS holidays).
Phase 1 = human prints+mails; phase 2 = wire print-mail API.
- worker-crons: pw-paper-batch systemd timer (Mon-Fri 13:30 UTC, self-gated).
- test_paper_batch.py: 15/15 pass (working-day gating, routing, cover+merge).
- deploy.sh/deploy-dev.sh: bring up listmonk-hc (upstream image, excluded from
build); document the one-time listmonk_hc DB create + --install.
- docker-compose.dev.override.yml: dev-only override (committed) that drops the
prod host-port bindings and pins dev's own postgres volume (dev-pgdata) via
compose !override tags. deploy-dev ships it as docker-compose.override.yml so
syncing the canonical compose to the shared host no longer breaks dev's
api-postgres (port :5432 clash + volume switch). Discovered + fixed while
validating listmonk-hc on dev.
- pw-hc-rampcap.sh: healthcare analogue of pw-listmonk-rampcap, ramps the
listmonk_hc cap 100->1000/h off /etc/postfix/hc-warmup-start, fully
independent of the trucking ramp/cap.
Adds 3 hc submission ports (2526/2527/2528) in the single Postfix instance,
each content_filter'd onto a dedicated hc transport (hcout1/2/3) binding the
hc IPs .107/.108/.109 with hc HELO identity (hcmta01-03) and hotter concurrency.
listmonk-hc round-robins the 3 ports.
Discovered + documented the constraint that drove this shape: transport_maps
randmap is owned by the shared trivial-rewrite(8) and is global, so neither a
per-smtpd -o transport_maps nor a FILTER randmap:{...} can scope a separate IP
pool (FILTER parses randmap as a literal transport). content_filter=hcoutN:
(empty nexthop) overrides transport_maps and keeps the real recipient domain.
Verified end-to-end on the server: :2527 -> hcout2 (.108) -> real gmail MX;
trucking transport_maps (.94-.96) untouched. Idempotent, postfix-check gated
with auto-rollback.
Chromium rejects authenticated SOCKS5 ('Browser does not support socks5 proxy
authentication'). Add a gost (ginuerzh/gost:2.11.5) 'proxy-relay' sidecar that
listens unauthenticated on socks5://proxy-relay:11080 and forwards to the
authenticated residential upstream (HEALTHCARE_PROXY_UPSTREAM_URL). Workers point
Playwright at the relay via HEALTHCARE_PROXY_URL=socks5://proxy-relay:11080.
env template: split into HEALTHCARE_PROXY_UPSTREAM_URL (authenticated, password
percent-encoded so '#' -> %23) and HEALTHCARE_PROXY_URL (the relay address).
Validated end-to-end on dev: workers Chromium -> proxy-relay -> residential
egress IP 76.228.206.147; NPPES + PECOS both HTTP 200.
CMS healthcare portals (NPPES, PECOS, I&A) block datacenter IPs, so the
healthcare browser automation needs to egress via the residential proxy on
hg409y7ez04.sn.mynetname.net (username 'performancewest').
- undetected_browser: use_proxy now accepts an env-var name, so callers can
select a domain-specific proxy. _proxy_config(proxy_env) reads it and falls
back to UNDETECTED_PROXY_URL. Healthcare uses 'HEALTHCARE_PROXY_URL'.
- probe_npi_undetected: launches with use_proxy='HEALTHCARE_PROXY_URL' when set.
- npi_provider: documents that the (future) automated NPPES/PECOS flows must
use the healthcare proxy.
- Plumb HEALTHCARE_PROXY_URL (+ UNDETECTED_PROXY_URL fallback) through the
ansible env template and docker-compose workers env.
The credential itself is NOT in the repo. Set the full URL in the ansible
vault as vault_healthcare_proxy_url:
socks5://performancewest:<password>@hg409y7ez04.sn.mynetname.net:<port>
Verified parsing + Playwright proxy-dict wiring with a unit test.
Adds a systemd-timed worker that nudges customers who paid but never completed
their intake form (which stalls fulfillment).
- migration 087: intake_reminder_count + intake_reminder_last_at on
compliance_orders (makes the daily run idempotent and bounded), plus a
partial index for the paid-order eligibility scan.
- scripts/workers/intake_reminder.py: each run emails any paid order with
intake_data_validated != TRUE, capped at 10 reminders/order, at most one
consolidated email per customer per day (groups a customer's incomplete
services into one email). Reuses the post-payment intake URL format
(/order/{slug}?order={n}) and the API's email validation, skipping
placeholder/invalid addresses (synthetic@, pipeline.com, etc.). Sends via
smtplib with SMTP_PASS (verified working in the worker container).
- worker-crons: pw-intake-reminder timer, daily ~noon ET (16:00 UTC).
The pw-portal-tls.conf.j2 template was stale (basic 47-line version) while the
live /etc/nginx/sites-enabled/pw-portal.conf was hand-maintained with branding,
/assets/ and /files/ serving. A future ansible run would have clobbered the
working config. Sync the template to the live config (templatized) and document
why /files/ must be served from /opt/erpnext-assets, not the docker volume.
- build_trucking_campaigns.py: nightly script that creates 8 Listmonk campaigns
per day (4 TZ x 2 types: MCS-150 overdue 2k/TZ, inactive USDOT 1k/TZ)
at 4AM ET / 5AM ET (CT) / 6AM ET (MT) / 7AM ET (PT). Deduplicates via
listmonk_sent_at column.
- migration 083: add listmonk_sent_at + listmonk_campaign_type to fmcsa_carriers
- email_verifier.py: bump max_workers from 5 to 20 for 4x faster throughput
- cron: daily pw-trucking-campaigns at 08:00 UTC (3 AM EST)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- CRTC letter now auto-emailed to secretary.general@crtc.gc.ca after eSign
- BITS admin todo updated to reference electronic + physical submission
- COLIN selectors.py: documented verification status per step
- BC config: added CRTC Secretary General email address
- plan.md: marked completed items (eSign, portal auth, CRTC email)
- go-live-todo.md: marked Compliance Calendar DocType as imported
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When any Playwright submission fails (selector not found, timeout, etc.):
1. Full-page screenshot captured and uploaded to MinIO
2. Telegram alert sent immediately with error details + screenshot link
3. Email alert to ops with same info
4. Admin todo includes screenshot MinIO path for debugging
5. Client order stays pending for manual completion
Proactive selector health check (daily 7am CT cron):
- Navigates to each portal (FCC RMD, USAC E-File, FCC CPNI/ECFS)
- Verifies all critical selectors are still present in the DOM
- If selectors are missing (UI changed): alerts via Telegram + email
BEFORE any real client order fails
- Reports which service slugs are affected
Integrated into:
- RMD filing handler (fccprod.servicenowservices.com)
- Form 499-A handler (forms.universalservice.org)
- Form 499-Q handler (already had error handling)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After 499-A+Q bundle is filed, the handler now creates actual
compliance_orders for each remaining quarterly 499-Q filing:
Schedule: Q1 due Feb 1, Q2 due May 1, Q3 due Aug 1, Q4 due Nov 1
Each quarterly order:
- Created as paid (covered by bundle price)
- Has due_date, quarter, period_end_date in intake_data
- Links to parent 499-A order
- Tracks reminder status (30d/14d/7d sent flags)
Notification worker (quarterly_499q_notify.py):
- Runs daily at 8am CT via systemd timer
- Sends HTML reminder emails at 30, 14, 7 days before due
- Email includes intake link for client to submit quarterly data
- Late warning at 7 days: "USAC may estimate higher contributions"
- Idempotent: won't re-send same reminder level
Added fcc-499q service slug ($0, not sold standalone).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive security update automation:
1. Debian OS (unattended-upgrades) — tightened to security-only:
- Removed general Debian updates (prevents feature/breaking changes)
- Only Debian-Security origins auto-installed
- Email admin on every upgrade via ops@performancewest.net
- Auto-reboot at 4 AM if kernel update requires it
- needrestart auto-restarts services after library updates
2. Docker CE — major version guard:
- Patch updates within pinned major version auto-applied
- Major version jumps held + admin alerted for manual review
- docker-ce, docker-ce-cli, containerd.io all version-guarded
3. Container base images — daily at 3:30 AM:
- Pulls latest base images for all docker-compose services
- Compares image digests — only rebuilds if changed
- Restarts only affected services (not full stack)
- Alerts admin on rebuild failures requiring manual intervention
- Covers both prod and dev compose projects
4. k3s — weekly Sunday at 3:45 AM:
- Patch updates within current minor auto-applied
- Minor/major upgrades alert admin for manual review
- Verifies node Ready status after update
- Alerts on failures with investigation instructions
5. Admin notifications via SMTP:
- [INFO] for successful patches
- [WARNING] for available major upgrades needing review
- [CRITICAL] for failures requiring immediate intervention
- Falls back to syslog if SMTP unavailable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>