new-site/docs/vertical-lead-source-analysis.md

7.8 KiB

Vertical Lead-Source Analysis: Ranked by Email Reliability

Date: 2026-06-13 Purpose: The proven bottleneck for every cold-email vertical is NOT the deficiency signal or the audience size -- it is whether a reliable, public, bulk source gives us a deliverable email (or a clean, high-yield path to one). This ranks candidate verticals by that single criterion, using what we verified this session (FCC, FMCSA work; CLIA email-match was 0.3% = dead).

The rule (learned the hard way)

A vertical is email-viable only if ONE of these is true:

  1. The public registry contains the email (FCC RMD contact_email, FMCSA carrier email_address). -> Tier 1, just send.
  2. The registry maps to a second free public source that has email by a clean key (NPI, FRN, CIK, domain). -> Tier 2, one enrichment hop.
  3. The targets reliably have websites so a domain->scrape gets email at decent yield. -> Tier 3, scrape pipeline (proxy). Otherwise it is phone / direct-mail only (CLIA, EPA RCRA, raw NPPES individuals). Still real money, just not cold email.

Tier 1 -- Email is IN the registry (send today)

Vertical Source Email field Recurring obligation Status
FCC carriers / VoIP / ISP FCC RMD, 499 filer, CORES contact_email (native) RMD annual, 499-A/Q, CPNI annual LIVE (built)
FMCSA trucking FMCSA carrier census email_address (native) MCS-150 biennial, IFTA quarterly, UCR annual LIVE (built)

These are the whole reason the business works. Nothing else is as clean.

Tier 2 -- One free public hop to email (worth building)

Vertical Registry (no email) Email source + key Yield estimate Notes
Healthcare providers (org NPIs) NPPES NPPES endpoint_pfile (Direct/email endpoints), keyed by NPI ~88k institutional emails harvested, ~63k verified ALREADY HARVESTED. The org/institutional slice has real emails (we filtered HISP/Direct gateways). Individual NPIs do NOT. Recurring: revalidation, NPPES update, OIG screening.
Public companies (OTC/SEC filers) SEC EDGAR (CIK, state of incorp, phone, addr, website) website domain -> scrape IR/contact email; or email-append Medium-high (real cos w/ IR pages) ~2,771 SEC-reporting OTC issuers; Delaware/Nevada heavy. Hook: reincorporate-to-TX, annual report, RA, franchise tax. Small but high-ticket.

Tier 3 -- Domain-scrape required (proxy pipeline; medium yield)

Vertical Registry Why scrape Yield
FMC Ocean Transportation Intermediaries (NVOCC/forwarders) FMC OTI lookup few thousand licensees, most have websites medium-high; small universe but real businesses + bonds renew
State business entities (formation/RA/foreign-qual) State SOS bulk (FL/CA/VA/TX free; Socrata) millions of entities, name+addr+officers, often a website low-medium per scrape, but HUGE universe; better to target by trigger (newly-formed, delinquent, foreign-qual)

Tier 4 -- Phone / direct-mail only (NOT cold email)

Vertical Registry Why not email Best channel
CLIA labs CMS POS CLIA file no NPI, no email; NPPES name+zip match = 0.3% (verified dead) postcard (~3,100/wk full coverage), phone
EPA RCRA hazardous-waste handlers ECHO bulk no email anywhere in ECHO phone (RCRAInfo), mail, append
NPPES individual providers NPPES individuals have phone/fax, rarely a usable org email phone, fax, web inbound

Net recommendation (where to invest next, in order)

  1. Mine the healthcare ORG emails we already harvested harder (Tier 2, zero new cost). 63k verified institutional emails -> diversify triggers beyond NPI revalidation: NPPES staleness, OIG/SAM screening, org-NPI corrections. The data is already on prod.
  2. SEC/OTC corporate (Tier 2). Small universe (~2.7k) but high-ticket (reincorporation, RA, franchise tax, foreign-qual) and a timely TX hook. EDGAR is free + bulk-OK; emails via website-domain scrape (we have the pipeline design from CLIA). Worth a pilot because the per-deal value is high.
  3. State business entities by TRIGGER (Tier 3, biggest universe). Do NOT blast all entities; target newly-formed (need RA/EIN/OA), delinquent/admin- dissolved (reinstatement), or foreign-qualification candidates. Free bulk from FL/CA/VA; email via domain-scrape. This is the largest TAM if the scrape yields.
  4. FMC OTI (Tier 3, small but clean): few thousand, website-rich, bonds renew annually. Quick win if we want another trucking-adjacent vein.
  5. CLIA / EPA RCRA: keep as phone/postcard, not email. Service + LP exist for CLIA; drive via mail to a "check your expiration" web tool that captures email.

The honest meta-point

We have spent effort proving that most government registries are email-poor. The reliable email money is: FCC + FMCSA (native), plus the healthcare org emails we already harvested. Everything else is either a scrape gamble or a phone/mail channel. Before building any new vertical, confirm its email path falls in Tier 1-2; if it is Tier 3, pilot the scrape yield FIRST (like we should have for CLIA); if Tier 4, don't pretend it is an email channel.

Update 2026-06-13: healthcare org-email diversification (ACTED ON #1)

Unlocked the full verified institutional pool for broad offers:

  • Root cause found: OIG/NPPES segments were gated by a warmup selector that excluded not_on_list rows (a deliverability proxy that excluded ~62k of the 63k -- org NPIs are not individual Medicare enrollees). Since we already SMTP-verified every inbox, added institutional_verified selector that trusts our verification. OIG screening + NPPES update now address 62,422 (was ~1,437).
  • enrich_institutional_revalidation.py joins the institutional list to the CMS Revalidation Due Date List (revalidation_base.csv) by NPI -> ~1,437 genuine Medicare enrollees (197 overdue / 164 due-soon) for the flagship $599 reval pitch.
  • pw-hc-nppes cron now runs oig_screening + nppes_outdated + revalidation_overdue
    • revalidation_due_soon against the enriched file (still warmup-capped + MX-throttled; bigger supply, same safe send rate). npi_reactivation stays on the accurate leie_or_deactivated selector (no false "deactivated" claims).
  • pw-hc-refresh cron now re-downloads + re-joins the reval base so overdue figures stay accurate.
  • MAINTENANCE: the CMS bulk file URLs (revalidation_base.csv, CLIA) embed a dated path that rotates ~monthly. If the download 404s, re-fetch the dataset's current downloadURL from https://data.cms.gov/data.json. Consider switching to the dataset's stable data-api endpoint.

Update 2026-06-14: trucking/main pool per-MX throttling (deliverability fix)

The persistent main-pool 54% delivery + Gmail/Outlook block storm (Jun 13-14) root cause, now PROVEN by MX-tagging the carrier pool:

  • 702,214 carriers on Google + 135,129 on Microsoft -- the warmup was hammering exactly the two operators blocking us (no per-MX throttle on trucking, only HC had it).
  • Fix: migration 097 (mx_provider) + mx_tag_carriers.py (concurrent MX resolve, bulk temp-table-join write -- 1.24M/1.49M carriers tagged). build_trucking_ campaigns now EXCLUDES Google/Microsoft/Proofpoint/etc. until warmup day 30 (reputation recovery), per-MX caps thereafter. Untagged carriers pass (most are now tagged).
  • Effect on the MCS-150 overdue pool: 496,743 sendable -> 230,135 after excluding 263,515 Google/MS carriers. Plenty of long-tail volume (yahoo/comcast/charter/ centurylink/windstream/earthlink/...) to warm on safely while reputation recovers.
  • MAINTENANCE: re-run mx_tag_carriers.py periodically (or add to the trucking cron precursor) to tag newly-added carriers; flip MAIN_BIG_MX_EXCLUDE_UNTIL_DAY or MAIN_SKIP_BIG_MX=0 once Postmaster Tools shows recovered reputation.