new-site/docs/clia-enrichment-plan.md
justin 766e32e555 docs: CLIA / multi-vertical email enrichment plan
Capture the full decision trail and chosen approach for making CLIA labs
emailable: why NPI->NPPES (0.3%) and DirectTrust failed, datacenter-IP search
blocking, the $99 B2B-list -> email-domain -> scrape-current-email -> verify
pipeline (durable domain even when the mailbox is stale), hard rules protecting
the warming mail pool, gzip/HTML-only bandwidth optimization, residential proxy
options, the sample-validation gate before committing, what's already built
(harvest, service, order page, email template), and the postcard fallback.
2026-06-13 23:07:08 -05:00

7.7 KiB

CLIA (and multi-vertical) Email Enrichment Plan

Status: planning / partially built Owner: Performance West Last updated: 2026-06-13

Goal

Turn the CMS CLIA laboratory file (676k labs; ~161k expiring in the next 12 months, ~13.4k/month) into a deliverable, emailable audience for the CLIA Certificate Renewal service, without harming the warming mail-sender pool, and build a reusable enrichment pipeline that works for trucking and telecom too.

Why this is needed

  • CLIA POS file has no NPI and no email -- only facility name, mailing address, phone, and the certificate expiration date (TRMNTN_EXPRTN_DT, the recurring 2-year renewal trigger).
  • Direct NPI->email join failed: matching CLIA -> emailable NPPES org by name+zip yielded only 186 / 69,791 (0.3%). Email-first via NPPES is dead.
  • DirectTrust / Direct Secure Messaging is a closed clinical trust network (referrals/TOC/fax-replacement), not cold-mailable from a normal MTA -- verified, not viable for marketing.
  • Datacenter IPs get bot-blocked by search engines almost immediately (15/15 rapid DDG queries blocked from the prod datacenter IP). Custom rDNS does NOT fix this (detection is ASN/rate-based, not PTR-based).

Channels by viability (CLIA audience)

Channel Viable? Why
Cold email via NPPES NPI match No 0.3% match
DirectTrust / Direct Secure Messaging No closed clinical network, AUP-restricted
Self-scrape search from datacenter IPs Fragile search engines bot-block datacenter IPs
Self-scrape search via residential proxy Yes residential exit IPs avoid bot detection
B2B append list ($99, monthly-updated) Test first gives email + DOMAIN; cheap + reusable
Phone Yes clean phone for ~all 161k
Direct mail / postcard Yes clean name+address for ~all 161k (~3,100/wk for full coverage)

Chosen approach: $99 B2B append list -> domain -> scrape current email -> verify

The append list's most durable asset is the email domain (practices keep their domain for years even as staff turn over). So:

  1. Buy the $99 B2B list (nationalemails.com, claims monthly updates; reusable across all verticals).
  2. Append/join the list to the CLIA file.
  3. Confidence filter: keep only rows where address OR phone matches the CMS CLIA record (confirms right entity, discards same-name mismatches).
  4. Extract the email domain from each matched record (durable even if the specific mailbox is stale).
  5. Fetch that known domain's website (home + /contact) and scrape the current contact email. Fetching KNOWN domains is cheap + reliable and may not even need the proxy (it is not search-engine scraping).
  6. Merge: prefer the freshly-scraped current email, fall back to the list email.
  7. Verify everything through the existing verifier (verify_csv_emails.py on the non-sending .72 IP: MX + SMTP RCPT + catch-all detection) BEFORE anything touches a warming IP.
  8. Output a send-ready CSV with mx_provider tags (for per-operator throttling).

Pipeline diagram

flowchart TD
  A["$99 B2B list (name, addr, phone, email)"] --> B["Append/join to CLIA file"]
  C["CMS CLIA file (name, addr, phone, expiry)"] --> B
  B --> D{"address OR phone matches?"}
  D -->|no| X["discard (wrong entity)"]
  D -->|yes| E["extract email DOMAIN"]
  E --> F["fetch known domain site\n(home + /contact)\ngzip, HTML-only, early-abort"]
  F --> G["scrape current contact email"]
  G --> H["merge: prefer scraped, fallback list email"]
  H --> I["verify via .72\n(MX + SMTP RCPT + catch-all)"]
  I --> J["send-ready CSV + mx_provider tags"]

Hard rules (non-negotiable)

  • NEVER load a purchased/scraped list directly into the warming pool. Everything goes through the verifier first. We are mid reputation-recovery (Gmail/Outlook throttled us after the 4k/day spike) -- a bad list re-tanks it.
  • Mail-sender pool (.94-.98) stays untouched by any scraping. Scraping egresses via residential proxy or the non-sending .72 IP only.
  • Address/phone match = right ENTITY confidence; verifier = DELIVERABLE confidence. Need both before sending.
  • CAN-SPAM: every commercial email carries the full postal address + unsubscribe (already enforced across templates).

Bandwidth optimization (if proxy is used, billed per GB)

  • Request Accept-Encoding: gzip, deflate, br (measured ~76% off a real clinic site: 68KB -> 16KB).
  • HTML document only -- skip images/CSS/JS.
  • Early-abort once a valid email is found (do not fetch /contact if home page had it).
  • Cap max bytes per page.
  • Net: ~161k run likely well under 5 GB -> ~$15-40 on a cheap residential proxy.

Residential proxy options (if needed for the search/scrape fallback)

Cheapest well-known, pay-as-you-go preferred:

  • IPRoyal (~$3.50-7/GB, credits do not expire) -- top pick for one-off + reusable
  • Webshare (~$3-5/GB) -- cheapest sticker if running regularly
  • Decodo / ex-Smartproxy (~$3.5-7/GB) -- smoothest dashboard
  • Avoid Bright Data / Oxylabs for this (premium price, reliability not needed on easy targets like direct site fetches).

Decision gate before committing

Get a sample export (50-100 rows) from nationalemails.com and:

  1. Confirm column format.
  2. Check actual coverage of small healthcare facilities (the CLIA long tail -- solo-doc labs. If coverage is thin like the NPPES match was, the domain will not be there to extract and this approach yields little).
  3. Append the sample to CLIA, measure address/phone match rate.
  4. Run matched emails through the verifier; measure verified-deliverable rate.
  5. If verified-deliverable rate is decent at $99 reusable -> proceed full. If poor -> fall back to postcard/phone channel for CLIA.

Already built (this session)

  • scripts/harvest_clia_renewals.py -- parse CMS CLIA file, filter to labs expiring within a window (default 120d), emit name/addr/phone/expiry. (676k scanned -> 69,791 expiring in [-30d, +120d]; on prod at data/npi_build/clia_renewals.csv.)
  • scripts/match_clia_to_nppes.py -- NPI bridge attempt (0.3% yield; kept for reference, not the path forward).
  • clia-renewal service in api/src/service-catalog.ts ($449, discountable) + order page site/src/pages/order/clia-renewal.astro + intake-manifest entry.
  • data/hc_campaigns/hc_clia_renewal.html -- warm turnover-safety-net email with the striped official-record card (CLIA #, expiry, status), verify-on-CMS-QCOR, founder guarantee card, full CAN-SPAM address.

To build (once list sample is validated)

  1. scripts/append_match.py -- join purchased list to a vertical file (CLIA/trucking/telecom), keep address/phone-matched rows, flag confidence, extract domain.
  2. scripts/scrape_domain_emails.py -- fetch known domains (home + /contact), gzip + HTML-only + early-abort, scrape current contact email; optional --proxy for residential egress.
  3. Wire output into verify_csv_emails.py -> send-ready CSV with mx_provider.
  4. Add clia_renewal segment to scripts/build_healthcare_campaigns.py SEGMENTS
    • cron, MX-throttled, once a verified emailable list exists.

Postcard alternative (if email yield stays poor)

  • ~161k labs/yr, avg ~13.4k/month (spikes: Aug ~26k, Mar ~20k = CMS batch months).
  • Mail ~90 days before expiry: ~3,100 postcards/week for full coverage (~620/day, 5-day). Smooth the Aug/Mar spikes over a 60-120d pre-expiry window.
  • All-in ~$0.40-0.75/card -> ~$5.4k-10k/month full coverage. Break-even ~12-22 conversions/month at the $449 service fee (~0.1-0.15% response).
  • Drive responders to a "check my CLIA" tool on the site to capture email at the point of interest (converts the unreachable-by-email audience into warm leads).