new-site/docs/entity-cache-sources.md
justin f8cd37ac8c Initial commit — Performance West telecom compliance platform
Includes: API (Express/TypeScript), Astro site, Python workers,
document generators, FCC compliance tools, Canada CRTC formation,
Ansible infrastructure, and deployment scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 06:54:22 -05:00

111 lines
5.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Entity Cache Data Sources
Bulk business entity data for the corporation status check feature.
Updated: 2026-04-20
## Working Socrata SODA API States (free, JSON, unlimited)
| State | Dataset ID | Records | Status Field | Formation State Field | Notes |
|-------|-----------|---------|--------------|----------------------|-------|
| CO | `4ykn-tg5h` | ~3M | `entitystatus` | `jurisdictonofformation` | Fully loaded |
| IA | `ykb6-ywnd` | ~500K | `entity_status` | `home_state` | Working |
| CT | `n7gp-d28j` | ~1.2M | `status` | `state_of_formation` | Working |
| OR | `tckn-sxa6` | ~800K | `status` | `state_of_origin` | Active businesses only |
| NY | `n9v6-gdp6` | ~2M | N/A (active only) | `jurisdiction` | No status field — all records are active |
**API pattern:** `https://data.{state}.gov/resource/{id}.json?$limit=50000&$offset=0&$order=:id`
## Broken Socrata URLs (portals reorganized, need new IDs)
| State | Old ID | Notes |
|-------|--------|-------|
| WA | `7naq-cqm3` | 404. data.wa.gov catalog empty for business category |
| IL | `vqps-xatp` | 404. IL SOS prohibits bulk scraping officially |
| PA | `6ftj-q3fu` | 404. PA has `xvd7-5r2c` but no status field |
| MI | `uc6u-xab8` | 404. LARA portal, no confirmed free download |
| AK | `p2kg-xwxr` | DNS failure. data.alaska.gov may be deprecated |
| VT | `c7cm-s92n` | 404. VT open data portal reorganized |
## Free Bulk Download (non-Socrata)
| State | Source | Format | Cost | Fields | Status |
|-------|--------|--------|------|--------|--------|
| FL | Sunbiz FTP | Fixed-width ASCII | Free (register for FTP creds) | Name, status (A/I), filing type, date, EIN, address, RA, officers | Has status |
| VA | data.virginia.gov | XLSX (~86MB) | Free | Name, address, officers, status, type, creation date | Has status |
**FL download:** https://dos.fl.gov/sunbiz/other-services/data-downloads/
**VA download:** https://data.virginia.gov/dataset/corporation
## Free Subscription Downloads
| State | Source | Cost | Records | Notes |
|-------|--------|------|---------|-------|
| CA | bizfileOnline.sos.ca.gov | **FREE** (weekly subscription) | ~17M | Sign up at BizFileOnline → BE & UCC Bulk Orders → Weekly Data Download |
| FL | sftp.floridados.gov | **FREE** (SFTP) | ~4M | User: Public / Pass: PubAccess1845! — Quarterly full + daily diffs |
## Paid Bulk Data
| State | Source | Cost | Notes |
|-------|--------|------|-------|
| WY | SOS subscription form | $10K+/year | Too expensive — we scrape WyoBiz instead |
| TX | SOSDirect bulk orders | $20/month (weekly) or $1,350 one-time | https://direct.sos.state.tx.us/help/help-corp.asp?pg=bulk |
| TX | Comptroller franchise tax | **FREE** on data.texas.gov (xn8i-yb9w) | 3.2M records but SODA API returns empty — may need portal CSV export |
| MN | SOS data subscription | $30/week (free non-commercial) | CSV, delivered within 10 days |
| NE | SOS special request | $15 per 1,000 records | CSV with filters |
| AZ | Corp Commission form M027 | $75 partial / $1,000 full | Importable format |
| NC | SOS data subscription | $750 initial + $250/year | FTP weekly updates |
| LA | SOS office | $6,900$12,500 | Too expensive |
## No Bulk Access (Playwright search only)
These states require live SOS portal searches via our Playwright adapters (~3-20s per lookup, cached 24h):
DE, IL, GA, MA, MD, NH, NJ, SC, SD, TN, KY, IN, MS, MO, WV, ND, OK, RI, HI, NM, NV (search API only), MT, NE (unless paid), AL, AR, KS, LA, ME
Our state adapters handle all 52 jurisdictions via `search_name()` for on-demand lookups.
## SEC EDGAR (public companies only)
For ~10K publicly-traded companies, SEC filings include authoritative state of incorporation:
- **Company list:** https://www.sec.gov/files/company_tickers.json
- **Detail:** https://data.sec.gov/submissions/CIK{padded_10}.json
- **Fields:** `stateOfIncorporation`, `name`, `ein`, `addresses`
- **Rate limit:** 10 req/sec, free, requires User-Agent header
- **Limitation:** Only SEC-registered filers (public companies, not private LLCs)
## Aggregator APIs
| Service | Free Tier | Coverage | Notes |
|---------|-----------|----------|-------|
| OpenCorporates | 200 calls/month | 170+ jurisdictions | Not viable for bulk. Paid plans start GBP 2,250/yr |
| Cobalt Intelligence | 20 free lookups | All 50 states | Credit-based paid API. Gold standard but expensive |
| Apify "US Business Entity Search" | Pay-per-use | 34 state registries | Uses SIP Public Data Gateway. Most comprehensive |
## Daily Cron
The `pw-entity-cache-refresh` timer runs at 07:00 UTC (2am CT) daily:
```
python -m scripts.formation.bulk_download --all
```
Downloads all configured Socrata states and upserts into `entity_cache`.
## Schema
```sql
-- entity_cache table (migration 009)
entity_name TEXT NOT NULL -- Uppercase
entity_number TEXT -- State filing number
entity_type TEXT -- LLC, CORPORATION, LP, NONPROFIT
status TEXT -- ACTIVE, DISSOLVED, SUSPENDED, DELINQUENT, INACTIVE
formation_date DATE
formation_state TEXT -- 2-letter code of state where entity was originally formed
registered_agent TEXT
principal_address TEXT
state TEXT NOT NULL -- State this record is registered in
source TEXT DEFAULT 'socrata'
UNIQUE(jurisdiction, entity_number)
INDEX gin_trgm on entity_name -- Fuzzy search
INDEX on state
INDEX on status
```