diff --git a/docs/clia-enrichment-plan.md b/docs/clia-enrichment-plan.md new file mode 100644 index 0000000..50612cc --- /dev/null +++ b/docs/clia-enrichment-plan.md @@ -0,0 +1,153 @@ +# CLIA (and multi-vertical) Email Enrichment Plan + +**Status:** planning / partially built +**Owner:** Performance West +**Last updated:** 2026-06-13 + +## Goal + +Turn the CMS CLIA laboratory file (676k labs; ~161k expiring in the next 12 +months, ~13.4k/month) into a **deliverable, emailable** audience for the CLIA +Certificate Renewal service, without harming the warming mail-sender pool, and +build a **reusable enrichment pipeline** that works for trucking and telecom too. + +## Why this is needed + +- CLIA POS file has **no NPI and no email** -- only facility name, mailing + address, phone, and the certificate expiration date (`TRMNTN_EXPRTN_DT`, the + recurring 2-year renewal trigger). +- Direct NPI->email join failed: matching CLIA -> emailable NPPES org by + name+zip yielded only **186 / 69,791 (0.3%)**. Email-first via NPPES is dead. +- DirectTrust / Direct Secure Messaging is a **closed clinical trust network** + (referrals/TOC/fax-replacement), not cold-mailable from a normal MTA -- verified, + not viable for marketing. +- Datacenter IPs get **bot-blocked by search engines** almost immediately + (15/15 rapid DDG queries blocked from the prod datacenter IP). Custom rDNS does + NOT fix this (detection is ASN/rate-based, not PTR-based). + +## Channels by viability (CLIA audience) + +| Channel | Viable? | Why | +|---|---|---| +| Cold email via NPPES NPI match | No | 0.3% match | +| DirectTrust / Direct Secure Messaging | No | closed clinical network, AUP-restricted | +| Self-scrape search from datacenter IPs | Fragile | search engines bot-block datacenter IPs | +| Self-scrape search via residential proxy | Yes | residential exit IPs avoid bot detection | +| **B2B append list ($99, monthly-updated)** | **Test first** | gives email + DOMAIN; cheap + reusable | +| Phone | Yes | clean phone for ~all 161k | +| Direct mail / postcard | Yes | clean name+address for ~all 161k (~3,100/wk for full coverage) | + +## Chosen approach: $99 B2B append list -> domain -> scrape current email -> verify + +The append list's most **durable** asset is the email **domain** (practices keep +their domain for years even as staff turn over). So: + +1. Buy the **$99 B2B list** (nationalemails.com, claims monthly updates; + reusable across all verticals). +2. **Append/join** the list to the CLIA file. +3. **Confidence filter:** keep only rows where **address OR phone matches** the + CMS CLIA record (confirms right entity, discards same-name mismatches). +4. **Extract the email domain** from each matched record (durable even if the + specific mailbox is stale). +5. **Fetch that known domain's website** (home + /contact) and scrape the + **current** contact email. Fetching KNOWN domains is cheap + reliable and may + not even need the proxy (it is not search-engine scraping). +6. **Merge:** prefer the freshly-scraped current email, fall back to the list email. +7. **Verify** everything through the existing verifier (`verify_csv_emails.py` on + the non-sending **.72** IP: MX + SMTP RCPT + catch-all detection) BEFORE + anything touches a warming IP. +8. Output a **send-ready CSV** with `mx_provider` tags (for per-operator throttling). + +### Pipeline diagram + +```mermaid +flowchart TD + A["$99 B2B list (name, addr, phone, email)"] --> B["Append/join to CLIA file"] + C["CMS CLIA file (name, addr, phone, expiry)"] --> B + B --> D{"address OR phone matches?"} + D -->|no| X["discard (wrong entity)"] + D -->|yes| E["extract email DOMAIN"] + E --> F["fetch known domain site\n(home + /contact)\ngzip, HTML-only, early-abort"] + F --> G["scrape current contact email"] + G --> H["merge: prefer scraped, fallback list email"] + H --> I["verify via .72\n(MX + SMTP RCPT + catch-all)"] + I --> J["send-ready CSV + mx_provider tags"] +``` + +## Hard rules (non-negotiable) + +- **NEVER load a purchased/scraped list directly into the warming pool.** + Everything goes through the verifier first. We are mid reputation-recovery + (Gmail/Outlook throttled us after the 4k/day spike) -- a bad list re-tanks it. +- **Mail-sender pool (.94-.98) stays untouched by any scraping.** Scraping + egresses via residential proxy or the non-sending .72 IP only. +- Address/phone match = right ENTITY confidence; verifier = DELIVERABLE + confidence. Need both before sending. +- CAN-SPAM: every commercial email carries the full postal address + unsubscribe + (already enforced across templates). + +## Bandwidth optimization (if proxy is used, billed per GB) + +- Request `Accept-Encoding: gzip, deflate, br` (measured ~76% off a real clinic + site: 68KB -> 16KB). +- HTML document only -- skip images/CSS/JS. +- Early-abort once a valid email is found (do not fetch /contact if home page had it). +- Cap max bytes per page. +- Net: ~161k run likely well under 5 GB -> ~$15-40 on a cheap residential proxy. + +## Residential proxy options (if needed for the search/scrape fallback) + +Cheapest well-known, pay-as-you-go preferred: +- **IPRoyal** (~$3.50-7/GB, credits do not expire) -- top pick for one-off + reusable +- **Webshare** (~$3-5/GB) -- cheapest sticker if running regularly +- **Decodo / ex-Smartproxy** (~$3.5-7/GB) -- smoothest dashboard +- Avoid Bright Data / Oxylabs for this (premium price, reliability not needed on + easy targets like direct site fetches). + +## Decision gate before committing + +Get a **sample export (50-100 rows)** from nationalemails.com and: +1. Confirm column format. +2. Check actual **coverage of small healthcare facilities** (the CLIA long tail -- + solo-doc labs. If coverage is thin like the NPPES match was, the domain will + not be there to extract and this approach yields little). +3. Append the sample to CLIA, measure address/phone match rate. +4. Run matched emails through the verifier; measure verified-deliverable rate. +5. If verified-deliverable rate is decent at $99 reusable -> proceed full. + If poor -> fall back to postcard/phone channel for CLIA. + +## Already built (this session) + +- `scripts/harvest_clia_renewals.py` -- parse CMS CLIA file, filter to labs + expiring within a window (default 120d), emit name/addr/phone/expiry. + (676k scanned -> 69,791 expiring in [-30d, +120d]; on prod at + `data/npi_build/clia_renewals.csv`.) +- `scripts/match_clia_to_nppes.py` -- NPI bridge attempt (0.3% yield; kept for + reference, not the path forward). +- `clia-renewal` service in `api/src/service-catalog.ts` ($449, discountable) + + order page `site/src/pages/order/clia-renewal.astro` + intake-manifest entry. +- `data/hc_campaigns/hc_clia_renewal.html` -- warm turnover-safety-net email with + the striped official-record card (CLIA #, expiry, status), verify-on-CMS-QCOR, + founder guarantee card, full CAN-SPAM address. + +## To build (once list sample is validated) + +1. `scripts/append_match.py` -- join purchased list to a vertical file + (CLIA/trucking/telecom), keep address/phone-matched rows, flag confidence, + extract domain. +2. `scripts/scrape_domain_emails.py` -- fetch known domains (home + /contact), + gzip + HTML-only + early-abort, scrape current contact email; optional + `--proxy` for residential egress. +3. Wire output into `verify_csv_emails.py` -> send-ready CSV with `mx_provider`. +4. Add `clia_renewal` segment to `scripts/build_healthcare_campaigns.py` SEGMENTS + + cron, MX-throttled, once a verified emailable list exists. + +## Postcard alternative (if email yield stays poor) + +- ~161k labs/yr, avg ~13.4k/month (spikes: Aug ~26k, Mar ~20k = CMS batch months). +- Mail ~90 days before expiry: **~3,100 postcards/week** for full coverage + (~620/day, 5-day). Smooth the Aug/Mar spikes over a 60-120d pre-expiry window. +- All-in ~$0.40-0.75/card -> ~$5.4k-10k/month full coverage. Break-even ~12-22 + conversions/month at the $449 service fee (~0.1-0.15% response). +- Drive responders to a "check my CLIA" tool on the site to capture email at the + point of interest (converts the unreachable-by-email audience into warm leads).