Capture the full decision trail and chosen approach for making CLIA labs emailable: why NPI->NPPES (0.3%) and DirectTrust failed, datacenter-IP search blocking, the $99 B2B-list -> email-domain -> scrape-current-email -> verify pipeline (durable domain even when the mailbox is stale), hard rules protecting the warming mail pool, gzip/HTML-only bandwidth optimization, residential proxy options, the sample-validation gate before committing, what's already built (harvest, service, order page, email template), and the postcard fallback.
7.7 KiB
7.7 KiB
CLIA (and multi-vertical) Email Enrichment Plan
Status: planning / partially built Owner: Performance West Last updated: 2026-06-13
Goal
Turn the CMS CLIA laboratory file (676k labs; ~161k expiring in the next 12 months, ~13.4k/month) into a deliverable, emailable audience for the CLIA Certificate Renewal service, without harming the warming mail-sender pool, and build a reusable enrichment pipeline that works for trucking and telecom too.
Why this is needed
- CLIA POS file has no NPI and no email -- only facility name, mailing
address, phone, and the certificate expiration date (
TRMNTN_EXPRTN_DT, the recurring 2-year renewal trigger). - Direct NPI->email join failed: matching CLIA -> emailable NPPES org by name+zip yielded only 186 / 69,791 (0.3%). Email-first via NPPES is dead.
- DirectTrust / Direct Secure Messaging is a closed clinical trust network (referrals/TOC/fax-replacement), not cold-mailable from a normal MTA -- verified, not viable for marketing.
- Datacenter IPs get bot-blocked by search engines almost immediately (15/15 rapid DDG queries blocked from the prod datacenter IP). Custom rDNS does NOT fix this (detection is ASN/rate-based, not PTR-based).
Channels by viability (CLIA audience)
| Channel | Viable? | Why |
|---|---|---|
| Cold email via NPPES NPI match | No | 0.3% match |
| DirectTrust / Direct Secure Messaging | No | closed clinical network, AUP-restricted |
| Self-scrape search from datacenter IPs | Fragile | search engines bot-block datacenter IPs |
| Self-scrape search via residential proxy | Yes | residential exit IPs avoid bot detection |
| B2B append list ($99, monthly-updated) | Test first | gives email + DOMAIN; cheap + reusable |
| Phone | Yes | clean phone for ~all 161k |
| Direct mail / postcard | Yes | clean name+address for ~all 161k (~3,100/wk for full coverage) |
Chosen approach: $99 B2B append list -> domain -> scrape current email -> verify
The append list's most durable asset is the email domain (practices keep their domain for years even as staff turn over). So:
- Buy the $99 B2B list (nationalemails.com, claims monthly updates; reusable across all verticals).
- Append/join the list to the CLIA file.
- Confidence filter: keep only rows where address OR phone matches the CMS CLIA record (confirms right entity, discards same-name mismatches).
- Extract the email domain from each matched record (durable even if the specific mailbox is stale).
- Fetch that known domain's website (home + /contact) and scrape the current contact email. Fetching KNOWN domains is cheap + reliable and may not even need the proxy (it is not search-engine scraping).
- Merge: prefer the freshly-scraped current email, fall back to the list email.
- Verify everything through the existing verifier (
verify_csv_emails.pyon the non-sending .72 IP: MX + SMTP RCPT + catch-all detection) BEFORE anything touches a warming IP. - Output a send-ready CSV with
mx_providertags (for per-operator throttling).
Pipeline diagram
flowchart TD
A["$99 B2B list (name, addr, phone, email)"] --> B["Append/join to CLIA file"]
C["CMS CLIA file (name, addr, phone, expiry)"] --> B
B --> D{"address OR phone matches?"}
D -->|no| X["discard (wrong entity)"]
D -->|yes| E["extract email DOMAIN"]
E --> F["fetch known domain site\n(home + /contact)\ngzip, HTML-only, early-abort"]
F --> G["scrape current contact email"]
G --> H["merge: prefer scraped, fallback list email"]
H --> I["verify via .72\n(MX + SMTP RCPT + catch-all)"]
I --> J["send-ready CSV + mx_provider tags"]
Hard rules (non-negotiable)
- NEVER load a purchased/scraped list directly into the warming pool. Everything goes through the verifier first. We are mid reputation-recovery (Gmail/Outlook throttled us after the 4k/day spike) -- a bad list re-tanks it.
- Mail-sender pool (.94-.98) stays untouched by any scraping. Scraping egresses via residential proxy or the non-sending .72 IP only.
- Address/phone match = right ENTITY confidence; verifier = DELIVERABLE confidence. Need both before sending.
- CAN-SPAM: every commercial email carries the full postal address + unsubscribe (already enforced across templates).
Bandwidth optimization (if proxy is used, billed per GB)
- Request
Accept-Encoding: gzip, deflate, br(measured ~76% off a real clinic site: 68KB -> 16KB). - HTML document only -- skip images/CSS/JS.
- Early-abort once a valid email is found (do not fetch /contact if home page had it).
- Cap max bytes per page.
- Net: ~161k run likely well under 5 GB -> ~$15-40 on a cheap residential proxy.
Residential proxy options (if needed for the search/scrape fallback)
Cheapest well-known, pay-as-you-go preferred:
- IPRoyal (~$3.50-7/GB, credits do not expire) -- top pick for one-off + reusable
- Webshare (~$3-5/GB) -- cheapest sticker if running regularly
- Decodo / ex-Smartproxy (~$3.5-7/GB) -- smoothest dashboard
- Avoid Bright Data / Oxylabs for this (premium price, reliability not needed on easy targets like direct site fetches).
Decision gate before committing
Get a sample export (50-100 rows) from nationalemails.com and:
- Confirm column format.
- Check actual coverage of small healthcare facilities (the CLIA long tail -- solo-doc labs. If coverage is thin like the NPPES match was, the domain will not be there to extract and this approach yields little).
- Append the sample to CLIA, measure address/phone match rate.
- Run matched emails through the verifier; measure verified-deliverable rate.
- If verified-deliverable rate is decent at $99 reusable -> proceed full. If poor -> fall back to postcard/phone channel for CLIA.
Already built (this session)
scripts/harvest_clia_renewals.py-- parse CMS CLIA file, filter to labs expiring within a window (default 120d), emit name/addr/phone/expiry. (676k scanned -> 69,791 expiring in [-30d, +120d]; on prod atdata/npi_build/clia_renewals.csv.)scripts/match_clia_to_nppes.py-- NPI bridge attempt (0.3% yield; kept for reference, not the path forward).clia-renewalservice inapi/src/service-catalog.ts($449, discountable) + order pagesite/src/pages/order/clia-renewal.astro+ intake-manifest entry.data/hc_campaigns/hc_clia_renewal.html-- warm turnover-safety-net email with the striped official-record card (CLIA #, expiry, status), verify-on-CMS-QCOR, founder guarantee card, full CAN-SPAM address.
To build (once list sample is validated)
scripts/append_match.py-- join purchased list to a vertical file (CLIA/trucking/telecom), keep address/phone-matched rows, flag confidence, extract domain.scripts/scrape_domain_emails.py-- fetch known domains (home + /contact), gzip + HTML-only + early-abort, scrape current contact email; optional--proxyfor residential egress.- Wire output into
verify_csv_emails.py-> send-ready CSV withmx_provider. - Add
clia_renewalsegment toscripts/build_healthcare_campaigns.pySEGMENTS- cron, MX-throttled, once a verified emailable list exists.
Postcard alternative (if email yield stays poor)
- ~161k labs/yr, avg ~13.4k/month (spikes: Aug ~26k, Mar ~20k = CMS batch months).
- Mail ~90 days before expiry: ~3,100 postcards/week for full coverage (~620/day, 5-day). Smooth the Aug/Mar spikes over a 60-120d pre-expiry window.
- All-in ~$0.40-0.75/card -> ~$5.4k-10k/month full coverage. Break-even ~12-22 conversions/month at the $449 service fee (~0.1-0.15% response).
- Drive responders to a "check my CLIA" tool on the site to capture email at the point of interest (converts the unreachable-by-email audience into warm leads).