new-site/docs/vertical-lead-source-analysis.md

122 lines
7.8 KiB
Markdown

# Vertical Lead-Source Analysis: Ranked by Email Reliability
**Date:** 2026-06-13
**Purpose:** The proven bottleneck for every cold-email vertical is NOT the
deficiency signal or the audience size -- it is whether a reliable, public, bulk
source gives us a **deliverable email** (or a clean, high-yield path to one).
This ranks candidate verticals by that single criterion, using what we verified
this session (FCC, FMCSA work; CLIA email-match was 0.3% = dead).
## The rule (learned the hard way)
A vertical is **email-viable** only if ONE of these is true:
1. The public registry **contains the email** (FCC RMD `contact_email`, FMCSA
carrier `email_address`). -> Tier 1, just send.
2. The registry maps to a **second free public source that has email** by a clean
key (NPI, FRN, CIK, domain). -> Tier 2, one enrichment hop.
3. The targets reliably have **websites** so a domain->scrape gets email at
decent yield. -> Tier 3, scrape pipeline (proxy).
Otherwise it is **phone / direct-mail only** (CLIA, EPA RCRA, raw NPPES
individuals). Still real money, just not cold email.
## Tier 1 -- Email is IN the registry (send today)
| Vertical | Source | Email field | Recurring obligation | Status |
|---|---|---|---|---|
| **FCC carriers / VoIP / ISP** | FCC RMD, 499 filer, CORES | `contact_email` (native) | RMD annual, 499-A/Q, CPNI annual | LIVE (built) |
| **FMCSA trucking** | FMCSA carrier census | `email_address` (native) | MCS-150 biennial, IFTA quarterly, UCR annual | LIVE (built) |
These are the whole reason the business works. Nothing else is as clean.
## Tier 2 -- One free public hop to email (worth building)
| Vertical | Registry (no email) | Email source + key | Yield estimate | Notes |
|---|---|---|---|---|
| **Healthcare providers (org NPIs)** | NPPES | NPPES **endpoint_pfile** (Direct/email endpoints), keyed by NPI | ~88k institutional emails harvested, ~63k verified | ALREADY HARVESTED. The org/institutional slice has real emails (we filtered HISP/Direct gateways). Individual NPIs do NOT. Recurring: revalidation, NPPES update, OIG screening. |
| **Public companies (OTC/SEC filers)** | SEC EDGAR (CIK, state of incorp, phone, addr, **website**) | website domain -> scrape IR/contact email; or email-append | Medium-high (real cos w/ IR pages) | ~2,771 SEC-reporting OTC issuers; Delaware/Nevada heavy. Hook: reincorporate-to-TX, annual report, RA, franchise tax. Small but high-ticket. |
## Tier 3 -- Domain-scrape required (proxy pipeline; medium yield)
| Vertical | Registry | Why scrape | Yield |
|---|---|---|---|
| **FMC Ocean Transportation Intermediaries (NVOCC/forwarders)** | FMC OTI lookup | few thousand licensees, most have websites | medium-high; small universe but real businesses + bonds renew |
| **State business entities (formation/RA/foreign-qual)** | State SOS bulk (FL/CA/VA/TX free; Socrata) | millions of entities, name+addr+officers, often a website | low-medium per scrape, but HUGE universe; better to target by trigger (newly-formed, delinquent, foreign-qual) |
## Tier 4 -- Phone / direct-mail only (NOT cold email)
| Vertical | Registry | Why not email | Best channel |
|---|---|---|---|
| **CLIA labs** | CMS POS CLIA file | no NPI, no email; NPPES name+zip match = **0.3%** (verified dead) | postcard (~3,100/wk full coverage), phone |
| **EPA RCRA hazardous-waste handlers** | ECHO bulk | no email anywhere in ECHO | phone (RCRAInfo), mail, append |
| **NPPES individual providers** | NPPES | individuals have phone/fax, rarely a usable org email | phone, fax, web inbound |
## Net recommendation (where to invest next, in order)
1. **Mine the healthcare ORG emails we already harvested harder** (Tier 2, zero
new cost). 63k verified institutional emails -> diversify triggers beyond NPI
revalidation: NPPES staleness, OIG/SAM screening, org-NPI corrections. The
data is already on prod.
2. **SEC/OTC corporate** (Tier 2). Small universe (~2.7k) but high-ticket
(reincorporation, RA, franchise tax, foreign-qual) and a timely TX hook.
EDGAR is free + bulk-OK; emails via website-domain scrape (we have the
pipeline design from CLIA). Worth a pilot because the per-deal value is high.
3. **State business entities by TRIGGER** (Tier 3, biggest universe). Do NOT
blast all entities; target newly-formed (need RA/EIN/OA), delinquent/admin-
dissolved (reinstatement), or foreign-qualification candidates. Free bulk from
FL/CA/VA; email via domain-scrape. This is the largest TAM if the scrape
yields.
4. **FMC OTI** (Tier 3, small but clean): few thousand, website-rich, bonds renew
annually. Quick win if we want another trucking-adjacent vein.
5. **CLIA / EPA RCRA: keep as phone/postcard**, not email. Service + LP exist for
CLIA; drive via mail to a "check your expiration" web tool that captures email.
## The honest meta-point
We have spent effort proving that **most government registries are email-poor.**
The reliable email money is: FCC + FMCSA (native), plus the **healthcare org
emails we already harvested**. Everything else is either a scrape gamble or a
phone/mail channel. Before building any new vertical, confirm its email path
falls in Tier 1-2; if it is Tier 3, pilot the scrape yield FIRST (like we should
have for CLIA); if Tier 4, don't pretend it is an email channel.
## Update 2026-06-13: healthcare org-email diversification (ACTED ON #1)
Unlocked the full verified institutional pool for broad offers:
- Root cause found: OIG/NPPES segments were gated by a warmup selector that
excluded `not_on_list` rows (a deliverability proxy that excluded ~62k of the
63k -- org NPIs are not individual Medicare enrollees). Since we already
SMTP-verified every inbox, added `institutional_verified` selector that trusts
our verification. OIG screening + NPPES update now address **62,422** (was
~1,437).
- `enrich_institutional_revalidation.py` joins the institutional list to the CMS
Revalidation Due Date List (revalidation_base.csv) by NPI -> ~1,437 genuine
Medicare enrollees (197 overdue / 164 due-soon) for the flagship $599 reval pitch.
- `pw-hc-nppes` cron now runs oig_screening + nppes_outdated + revalidation_overdue
+ revalidation_due_soon against the enriched file (still warmup-capped +
MX-throttled; bigger supply, same safe send rate). npi_reactivation stays on
the accurate leie_or_deactivated selector (no false "deactivated" claims).
- `pw-hc-refresh` cron now re-downloads + re-joins the reval base so overdue
figures stay accurate.
- MAINTENANCE: the CMS bulk file URLs (revalidation_base.csv, CLIA) embed a
dated path that rotates ~monthly. If the download 404s, re-fetch the dataset's
current downloadURL from https://data.cms.gov/data.json. Consider switching to
the dataset's stable data-api endpoint.
## Update 2026-06-14: trucking/main pool per-MX throttling (deliverability fix)
The persistent main-pool 54% delivery + Gmail/Outlook block storm (Jun 13-14)
root cause, now PROVEN by MX-tagging the carrier pool:
- **702,214 carriers on Google + 135,129 on Microsoft** -- the warmup was
hammering exactly the two operators blocking us (no per-MX throttle on trucking,
only HC had it).
- Fix: migration 097 (mx_provider) + mx_tag_carriers.py (concurrent MX resolve,
bulk temp-table-join write -- 1.24M/1.49M carriers tagged). build_trucking_
campaigns now EXCLUDES Google/Microsoft/Proofpoint/etc. until warmup day 30
(reputation recovery), per-MX caps thereafter. Untagged carriers pass (most are
now tagged).
- Effect on the MCS-150 overdue pool: 496,743 sendable -> 230,135 after excluding
263,515 Google/MS carriers. Plenty of long-tail volume (yahoo/comcast/charter/
centurylink/windstream/earthlink/...) to warm on safely while reputation recovers.
- MAINTENANCE: re-run mx_tag_carriers.py periodically (or add to the trucking
cron precursor) to tag newly-added carriers; flip MAIN_BIG_MX_EXCLUDE_UNTIL_DAY
or MAIN_SKIP_BIG_MX=0 once Postmaster Tools shows recovered reputation.