docs: CLIA / multi-vertical email enrichment plan
Capture the full decision trail and chosen approach for making CLIA labs emailable: why NPI->NPPES (0.3%) and DirectTrust failed, datacenter-IP search blocking, the $99 B2B-list -> email-domain -> scrape-current-email -> verify pipeline (durable domain even when the mailbox is stale), hard rules protecting the warming mail pool, gzip/HTML-only bandwidth optimization, residential proxy options, the sample-validation gate before committing, what's already built (harvest, service, order page, email template), and the postcard fallback.
This commit is contained in:
parent
9c7a08f5c9
commit
766e32e555
1 changed files with 153 additions and 0 deletions
153
docs/clia-enrichment-plan.md
Normal file
153
docs/clia-enrichment-plan.md
Normal file
|
|
@ -0,0 +1,153 @@
|
|||
# CLIA (and multi-vertical) Email Enrichment Plan
|
||||
|
||||
**Status:** planning / partially built
|
||||
**Owner:** Performance West
|
||||
**Last updated:** 2026-06-13
|
||||
|
||||
## Goal
|
||||
|
||||
Turn the CMS CLIA laboratory file (676k labs; ~161k expiring in the next 12
|
||||
months, ~13.4k/month) into a **deliverable, emailable** audience for the CLIA
|
||||
Certificate Renewal service, without harming the warming mail-sender pool, and
|
||||
build a **reusable enrichment pipeline** that works for trucking and telecom too.
|
||||
|
||||
## Why this is needed
|
||||
|
||||
- CLIA POS file has **no NPI and no email** -- only facility name, mailing
|
||||
address, phone, and the certificate expiration date (`TRMNTN_EXPRTN_DT`, the
|
||||
recurring 2-year renewal trigger).
|
||||
- Direct NPI->email join failed: matching CLIA -> emailable NPPES org by
|
||||
name+zip yielded only **186 / 69,791 (0.3%)**. Email-first via NPPES is dead.
|
||||
- DirectTrust / Direct Secure Messaging is a **closed clinical trust network**
|
||||
(referrals/TOC/fax-replacement), not cold-mailable from a normal MTA -- verified,
|
||||
not viable for marketing.
|
||||
- Datacenter IPs get **bot-blocked by search engines** almost immediately
|
||||
(15/15 rapid DDG queries blocked from the prod datacenter IP). Custom rDNS does
|
||||
NOT fix this (detection is ASN/rate-based, not PTR-based).
|
||||
|
||||
## Channels by viability (CLIA audience)
|
||||
|
||||
| Channel | Viable? | Why |
|
||||
|---|---|---|
|
||||
| Cold email via NPPES NPI match | No | 0.3% match |
|
||||
| DirectTrust / Direct Secure Messaging | No | closed clinical network, AUP-restricted |
|
||||
| Self-scrape search from datacenter IPs | Fragile | search engines bot-block datacenter IPs |
|
||||
| Self-scrape search via residential proxy | Yes | residential exit IPs avoid bot detection |
|
||||
| **B2B append list ($99, monthly-updated)** | **Test first** | gives email + DOMAIN; cheap + reusable |
|
||||
| Phone | Yes | clean phone for ~all 161k |
|
||||
| Direct mail / postcard | Yes | clean name+address for ~all 161k (~3,100/wk for full coverage) |
|
||||
|
||||
## Chosen approach: $99 B2B append list -> domain -> scrape current email -> verify
|
||||
|
||||
The append list's most **durable** asset is the email **domain** (practices keep
|
||||
their domain for years even as staff turn over). So:
|
||||
|
||||
1. Buy the **$99 B2B list** (nationalemails.com, claims monthly updates;
|
||||
reusable across all verticals).
|
||||
2. **Append/join** the list to the CLIA file.
|
||||
3. **Confidence filter:** keep only rows where **address OR phone matches** the
|
||||
CMS CLIA record (confirms right entity, discards same-name mismatches).
|
||||
4. **Extract the email domain** from each matched record (durable even if the
|
||||
specific mailbox is stale).
|
||||
5. **Fetch that known domain's website** (home + /contact) and scrape the
|
||||
**current** contact email. Fetching KNOWN domains is cheap + reliable and may
|
||||
not even need the proxy (it is not search-engine scraping).
|
||||
6. **Merge:** prefer the freshly-scraped current email, fall back to the list email.
|
||||
7. **Verify** everything through the existing verifier (`verify_csv_emails.py` on
|
||||
the non-sending **.72** IP: MX + SMTP RCPT + catch-all detection) BEFORE
|
||||
anything touches a warming IP.
|
||||
8. Output a **send-ready CSV** with `mx_provider` tags (for per-operator throttling).
|
||||
|
||||
### Pipeline diagram
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A["$99 B2B list (name, addr, phone, email)"] --> B["Append/join to CLIA file"]
|
||||
C["CMS CLIA file (name, addr, phone, expiry)"] --> B
|
||||
B --> D{"address OR phone matches?"}
|
||||
D -->|no| X["discard (wrong entity)"]
|
||||
D -->|yes| E["extract email DOMAIN"]
|
||||
E --> F["fetch known domain site\n(home + /contact)\ngzip, HTML-only, early-abort"]
|
||||
F --> G["scrape current contact email"]
|
||||
G --> H["merge: prefer scraped, fallback list email"]
|
||||
H --> I["verify via .72\n(MX + SMTP RCPT + catch-all)"]
|
||||
I --> J["send-ready CSV + mx_provider tags"]
|
||||
```
|
||||
|
||||
## Hard rules (non-negotiable)
|
||||
|
||||
- **NEVER load a purchased/scraped list directly into the warming pool.**
|
||||
Everything goes through the verifier first. We are mid reputation-recovery
|
||||
(Gmail/Outlook throttled us after the 4k/day spike) -- a bad list re-tanks it.
|
||||
- **Mail-sender pool (.94-.98) stays untouched by any scraping.** Scraping
|
||||
egresses via residential proxy or the non-sending .72 IP only.
|
||||
- Address/phone match = right ENTITY confidence; verifier = DELIVERABLE
|
||||
confidence. Need both before sending.
|
||||
- CAN-SPAM: every commercial email carries the full postal address + unsubscribe
|
||||
(already enforced across templates).
|
||||
|
||||
## Bandwidth optimization (if proxy is used, billed per GB)
|
||||
|
||||
- Request `Accept-Encoding: gzip, deflate, br` (measured ~76% off a real clinic
|
||||
site: 68KB -> 16KB).
|
||||
- HTML document only -- skip images/CSS/JS.
|
||||
- Early-abort once a valid email is found (do not fetch /contact if home page had it).
|
||||
- Cap max bytes per page.
|
||||
- Net: ~161k run likely well under 5 GB -> ~$15-40 on a cheap residential proxy.
|
||||
|
||||
## Residential proxy options (if needed for the search/scrape fallback)
|
||||
|
||||
Cheapest well-known, pay-as-you-go preferred:
|
||||
- **IPRoyal** (~$3.50-7/GB, credits do not expire) -- top pick for one-off + reusable
|
||||
- **Webshare** (~$3-5/GB) -- cheapest sticker if running regularly
|
||||
- **Decodo / ex-Smartproxy** (~$3.5-7/GB) -- smoothest dashboard
|
||||
- Avoid Bright Data / Oxylabs for this (premium price, reliability not needed on
|
||||
easy targets like direct site fetches).
|
||||
|
||||
## Decision gate before committing
|
||||
|
||||
Get a **sample export (50-100 rows)** from nationalemails.com and:
|
||||
1. Confirm column format.
|
||||
2. Check actual **coverage of small healthcare facilities** (the CLIA long tail --
|
||||
solo-doc labs. If coverage is thin like the NPPES match was, the domain will
|
||||
not be there to extract and this approach yields little).
|
||||
3. Append the sample to CLIA, measure address/phone match rate.
|
||||
4. Run matched emails through the verifier; measure verified-deliverable rate.
|
||||
5. If verified-deliverable rate is decent at $99 reusable -> proceed full.
|
||||
If poor -> fall back to postcard/phone channel for CLIA.
|
||||
|
||||
## Already built (this session)
|
||||
|
||||
- `scripts/harvest_clia_renewals.py` -- parse CMS CLIA file, filter to labs
|
||||
expiring within a window (default 120d), emit name/addr/phone/expiry.
|
||||
(676k scanned -> 69,791 expiring in [-30d, +120d]; on prod at
|
||||
`data/npi_build/clia_renewals.csv`.)
|
||||
- `scripts/match_clia_to_nppes.py` -- NPI bridge attempt (0.3% yield; kept for
|
||||
reference, not the path forward).
|
||||
- `clia-renewal` service in `api/src/service-catalog.ts` ($449, discountable) +
|
||||
order page `site/src/pages/order/clia-renewal.astro` + intake-manifest entry.
|
||||
- `data/hc_campaigns/hc_clia_renewal.html` -- warm turnover-safety-net email with
|
||||
the striped official-record card (CLIA #, expiry, status), verify-on-CMS-QCOR,
|
||||
founder guarantee card, full CAN-SPAM address.
|
||||
|
||||
## To build (once list sample is validated)
|
||||
|
||||
1. `scripts/append_match.py` -- join purchased list to a vertical file
|
||||
(CLIA/trucking/telecom), keep address/phone-matched rows, flag confidence,
|
||||
extract domain.
|
||||
2. `scripts/scrape_domain_emails.py` -- fetch known domains (home + /contact),
|
||||
gzip + HTML-only + early-abort, scrape current contact email; optional
|
||||
`--proxy` for residential egress.
|
||||
3. Wire output into `verify_csv_emails.py` -> send-ready CSV with `mx_provider`.
|
||||
4. Add `clia_renewal` segment to `scripts/build_healthcare_campaigns.py` SEGMENTS
|
||||
+ cron, MX-throttled, once a verified emailable list exists.
|
||||
|
||||
## Postcard alternative (if email yield stays poor)
|
||||
|
||||
- ~161k labs/yr, avg ~13.4k/month (spikes: Aug ~26k, Mar ~20k = CMS batch months).
|
||||
- Mail ~90 days before expiry: **~3,100 postcards/week** for full coverage
|
||||
(~620/day, 5-day). Smooth the Aug/Mar spikes over a 60-120d pre-expiry window.
|
||||
- All-in ~$0.40-0.75/card -> ~$5.4k-10k/month full coverage. Break-even ~12-22
|
||||
conversions/month at the $449 service fee (~0.1-0.15% response).
|
||||
- Drive responders to a "check my CLIA" tool on the site to capture email at the
|
||||
point of interest (converts the unreachable-by-email audience into warm leads).
|
||||
Loading…
Add table
Add a link
Reference in a new issue