new-site/docs/entity-cache-sources.md
justin f8cd37ac8c Initial commit — Performance West telecom compliance platform
Includes: API (Express/TypeScript), Astro site, Python workers,
document generators, FCC compliance tools, Canada CRTC formation,
Ansible infrastructure, and deployment scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 06:54:22 -05:00

5.4 KiB
Raw Blame History

Entity Cache Data Sources

Bulk business entity data for the corporation status check feature. Updated: 2026-04-20

Working Socrata SODA API States (free, JSON, unlimited)

State Dataset ID Records Status Field Formation State Field Notes
CO 4ykn-tg5h ~3M entitystatus jurisdictonofformation Fully loaded
IA ykb6-ywnd ~500K entity_status home_state Working
CT n7gp-d28j ~1.2M status state_of_formation Working
OR tckn-sxa6 ~800K status state_of_origin Active businesses only
NY n9v6-gdp6 ~2M N/A (active only) jurisdiction No status field — all records are active

API pattern: https://data.{state}.gov/resource/{id}.json?$limit=50000&$offset=0&$order=:id

Broken Socrata URLs (portals reorganized, need new IDs)

State Old ID Notes
WA 7naq-cqm3 404. data.wa.gov catalog empty for business category
IL vqps-xatp 404. IL SOS prohibits bulk scraping officially
PA 6ftj-q3fu 404. PA has xvd7-5r2c but no status field
MI uc6u-xab8 404. LARA portal, no confirmed free download
AK p2kg-xwxr DNS failure. data.alaska.gov may be deprecated
VT c7cm-s92n 404. VT open data portal reorganized

Free Bulk Download (non-Socrata)

State Source Format Cost Fields Status
FL Sunbiz FTP Fixed-width ASCII Free (register for FTP creds) Name, status (A/I), filing type, date, EIN, address, RA, officers Has status
VA data.virginia.gov XLSX (~86MB) Free Name, address, officers, status, type, creation date Has status

FL download: https://dos.fl.gov/sunbiz/other-services/data-downloads/ VA download: https://data.virginia.gov/dataset/corporation

Free Subscription Downloads

State Source Cost Records Notes
CA bizfileOnline.sos.ca.gov FREE (weekly subscription) ~17M Sign up at BizFileOnline → BE & UCC Bulk Orders → Weekly Data Download
FL sftp.floridados.gov FREE (SFTP) ~4M User: Public / Pass: PubAccess1845! — Quarterly full + daily diffs

Paid Bulk Data

State Source Cost Notes
WY SOS subscription form $10K+/year Too expensive — we scrape WyoBiz instead
TX SOSDirect bulk orders $20/month (weekly) or $1,350 one-time https://direct.sos.state.tx.us/help/help-corp.asp?pg=bulk
TX Comptroller franchise tax FREE on data.texas.gov (xn8i-yb9w) 3.2M records but SODA API returns empty — may need portal CSV export
MN SOS data subscription $30/week (free non-commercial) CSV, delivered within 10 days
NE SOS special request $15 per 1,000 records CSV with filters
AZ Corp Commission form M027 $75 partial / $1,000 full Importable format
NC SOS data subscription $750 initial + $250/year FTP weekly updates
LA SOS office $6,900$12,500 Too expensive

No Bulk Access (Playwright search only)

These states require live SOS portal searches via our Playwright adapters (~3-20s per lookup, cached 24h):

DE, IL, GA, MA, MD, NH, NJ, SC, SD, TN, KY, IN, MS, MO, WV, ND, OK, RI, HI, NM, NV (search API only), MT, NE (unless paid), AL, AR, KS, LA, ME

Our state adapters handle all 52 jurisdictions via search_name() for on-demand lookups.

SEC EDGAR (public companies only)

For ~10K publicly-traded companies, SEC filings include authoritative state of incorporation:

Aggregator APIs

Service Free Tier Coverage Notes
OpenCorporates 200 calls/month 170+ jurisdictions Not viable for bulk. Paid plans start GBP 2,250/yr
Cobalt Intelligence 20 free lookups All 50 states Credit-based paid API. Gold standard but expensive
Apify "US Business Entity Search" Pay-per-use 34 state registries Uses SIP Public Data Gateway. Most comprehensive

Daily Cron

The pw-entity-cache-refresh timer runs at 07:00 UTC (2am CT) daily:

python -m scripts.formation.bulk_download --all

Downloads all configured Socrata states and upserts into entity_cache.

Schema

-- entity_cache table (migration 009)
entity_name         TEXT NOT NULL        -- Uppercase
entity_number       TEXT                 -- State filing number
entity_type         TEXT                 -- LLC, CORPORATION, LP, NONPROFIT
status              TEXT                 -- ACTIVE, DISSOLVED, SUSPENDED, DELINQUENT, INACTIVE
formation_date      DATE
formation_state     TEXT                 -- 2-letter code of state where entity was originally formed
registered_agent    TEXT
principal_address   TEXT
state               TEXT NOT NULL        -- State this record is registered in
source              TEXT DEFAULT 'socrata'

UNIQUE(jurisdiction, entity_number)
INDEX gin_trgm on entity_name          -- Fuzzy search
INDEX on state
INDEX on status