new-site/docs/clia-enrichment-plan.md
justin 766e32e555 docs: CLIA / multi-vertical email enrichment plan
Capture the full decision trail and chosen approach for making CLIA labs
emailable: why NPI->NPPES (0.3%) and DirectTrust failed, datacenter-IP search
blocking, the $99 B2B-list -> email-domain -> scrape-current-email -> verify
pipeline (durable domain even when the mailbox is stale), hard rules protecting
the warming mail pool, gzip/HTML-only bandwidth optimization, residential proxy
options, the sample-validation gate before committing, what's already built
(harvest, service, order page, email template), and the postcard fallback.
2026-06-13 23:07:08 -05:00

153 lines
7.7 KiB
Markdown

# CLIA (and multi-vertical) Email Enrichment Plan
**Status:** planning / partially built
**Owner:** Performance West
**Last updated:** 2026-06-13
## Goal
Turn the CMS CLIA laboratory file (676k labs; ~161k expiring in the next 12
months, ~13.4k/month) into a **deliverable, emailable** audience for the CLIA
Certificate Renewal service, without harming the warming mail-sender pool, and
build a **reusable enrichment pipeline** that works for trucking and telecom too.
## Why this is needed
- CLIA POS file has **no NPI and no email** -- only facility name, mailing
address, phone, and the certificate expiration date (`TRMNTN_EXPRTN_DT`, the
recurring 2-year renewal trigger).
- Direct NPI->email join failed: matching CLIA -> emailable NPPES org by
name+zip yielded only **186 / 69,791 (0.3%)**. Email-first via NPPES is dead.
- DirectTrust / Direct Secure Messaging is a **closed clinical trust network**
(referrals/TOC/fax-replacement), not cold-mailable from a normal MTA -- verified,
not viable for marketing.
- Datacenter IPs get **bot-blocked by search engines** almost immediately
(15/15 rapid DDG queries blocked from the prod datacenter IP). Custom rDNS does
NOT fix this (detection is ASN/rate-based, not PTR-based).
## Channels by viability (CLIA audience)
| Channel | Viable? | Why |
|---|---|---|
| Cold email via NPPES NPI match | No | 0.3% match |
| DirectTrust / Direct Secure Messaging | No | closed clinical network, AUP-restricted |
| Self-scrape search from datacenter IPs | Fragile | search engines bot-block datacenter IPs |
| Self-scrape search via residential proxy | Yes | residential exit IPs avoid bot detection |
| **B2B append list ($99, monthly-updated)** | **Test first** | gives email + DOMAIN; cheap + reusable |
| Phone | Yes | clean phone for ~all 161k |
| Direct mail / postcard | Yes | clean name+address for ~all 161k (~3,100/wk for full coverage) |
## Chosen approach: $99 B2B append list -> domain -> scrape current email -> verify
The append list's most **durable** asset is the email **domain** (practices keep
their domain for years even as staff turn over). So:
1. Buy the **$99 B2B list** (nationalemails.com, claims monthly updates;
reusable across all verticals).
2. **Append/join** the list to the CLIA file.
3. **Confidence filter:** keep only rows where **address OR phone matches** the
CMS CLIA record (confirms right entity, discards same-name mismatches).
4. **Extract the email domain** from each matched record (durable even if the
specific mailbox is stale).
5. **Fetch that known domain's website** (home + /contact) and scrape the
**current** contact email. Fetching KNOWN domains is cheap + reliable and may
not even need the proxy (it is not search-engine scraping).
6. **Merge:** prefer the freshly-scraped current email, fall back to the list email.
7. **Verify** everything through the existing verifier (`verify_csv_emails.py` on
the non-sending **.72** IP: MX + SMTP RCPT + catch-all detection) BEFORE
anything touches a warming IP.
8. Output a **send-ready CSV** with `mx_provider` tags (for per-operator throttling).
### Pipeline diagram
```mermaid
flowchart TD
A["$99 B2B list (name, addr, phone, email)"] --> B["Append/join to CLIA file"]
C["CMS CLIA file (name, addr, phone, expiry)"] --> B
B --> D{"address OR phone matches?"}
D -->|no| X["discard (wrong entity)"]
D -->|yes| E["extract email DOMAIN"]
E --> F["fetch known domain site\n(home + /contact)\ngzip, HTML-only, early-abort"]
F --> G["scrape current contact email"]
G --> H["merge: prefer scraped, fallback list email"]
H --> I["verify via .72\n(MX + SMTP RCPT + catch-all)"]
I --> J["send-ready CSV + mx_provider tags"]
```
## Hard rules (non-negotiable)
- **NEVER load a purchased/scraped list directly into the warming pool.**
Everything goes through the verifier first. We are mid reputation-recovery
(Gmail/Outlook throttled us after the 4k/day spike) -- a bad list re-tanks it.
- **Mail-sender pool (.94-.98) stays untouched by any scraping.** Scraping
egresses via residential proxy or the non-sending .72 IP only.
- Address/phone match = right ENTITY confidence; verifier = DELIVERABLE
confidence. Need both before sending.
- CAN-SPAM: every commercial email carries the full postal address + unsubscribe
(already enforced across templates).
## Bandwidth optimization (if proxy is used, billed per GB)
- Request `Accept-Encoding: gzip, deflate, br` (measured ~76% off a real clinic
site: 68KB -> 16KB).
- HTML document only -- skip images/CSS/JS.
- Early-abort once a valid email is found (do not fetch /contact if home page had it).
- Cap max bytes per page.
- Net: ~161k run likely well under 5 GB -> ~$15-40 on a cheap residential proxy.
## Residential proxy options (if needed for the search/scrape fallback)
Cheapest well-known, pay-as-you-go preferred:
- **IPRoyal** (~$3.50-7/GB, credits do not expire) -- top pick for one-off + reusable
- **Webshare** (~$3-5/GB) -- cheapest sticker if running regularly
- **Decodo / ex-Smartproxy** (~$3.5-7/GB) -- smoothest dashboard
- Avoid Bright Data / Oxylabs for this (premium price, reliability not needed on
easy targets like direct site fetches).
## Decision gate before committing
Get a **sample export (50-100 rows)** from nationalemails.com and:
1. Confirm column format.
2. Check actual **coverage of small healthcare facilities** (the CLIA long tail --
solo-doc labs. If coverage is thin like the NPPES match was, the domain will
not be there to extract and this approach yields little).
3. Append the sample to CLIA, measure address/phone match rate.
4. Run matched emails through the verifier; measure verified-deliverable rate.
5. If verified-deliverable rate is decent at $99 reusable -> proceed full.
If poor -> fall back to postcard/phone channel for CLIA.
## Already built (this session)
- `scripts/harvest_clia_renewals.py` -- parse CMS CLIA file, filter to labs
expiring within a window (default 120d), emit name/addr/phone/expiry.
(676k scanned -> 69,791 expiring in [-30d, +120d]; on prod at
`data/npi_build/clia_renewals.csv`.)
- `scripts/match_clia_to_nppes.py` -- NPI bridge attempt (0.3% yield; kept for
reference, not the path forward).
- `clia-renewal` service in `api/src/service-catalog.ts` ($449, discountable) +
order page `site/src/pages/order/clia-renewal.astro` + intake-manifest entry.
- `data/hc_campaigns/hc_clia_renewal.html` -- warm turnover-safety-net email with
the striped official-record card (CLIA #, expiry, status), verify-on-CMS-QCOR,
founder guarantee card, full CAN-SPAM address.
## To build (once list sample is validated)
1. `scripts/append_match.py` -- join purchased list to a vertical file
(CLIA/trucking/telecom), keep address/phone-matched rows, flag confidence,
extract domain.
2. `scripts/scrape_domain_emails.py` -- fetch known domains (home + /contact),
gzip + HTML-only + early-abort, scrape current contact email; optional
`--proxy` for residential egress.
3. Wire output into `verify_csv_emails.py` -> send-ready CSV with `mx_provider`.
4. Add `clia_renewal` segment to `scripts/build_healthcare_campaigns.py` SEGMENTS
+ cron, MX-throttled, once a verified emailable list exists.
## Postcard alternative (if email yield stays poor)
- ~161k labs/yr, avg ~13.4k/month (spikes: Aug ~26k, Mar ~20k = CMS batch months).
- Mail ~90 days before expiry: **~3,100 postcards/week** for full coverage
(~620/day, 5-day). Smooth the Aug/Mar spikes over a 60-120d pre-expiry window.
- All-in ~$0.40-0.75/card -> ~$5.4k-10k/month full coverage. Break-even ~12-22
conversions/month at the $449 service fee (~0.1-0.15% response).
- Drive responders to a "check my CLIA" tool on the site to capture email at the
point of interest (converts the unreachable-by-email audience into warm leads).