Includes: API (Express/TypeScript), Astro site, Python workers, document generators, FCC compliance tools, Canada CRTC formation, Ansible infrastructure, and deployment scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
111 lines
5.4 KiB
Markdown
111 lines
5.4 KiB
Markdown
# Entity Cache Data Sources
|
||
|
||
Bulk business entity data for the corporation status check feature.
|
||
Updated: 2026-04-20
|
||
|
||
## Working Socrata SODA API States (free, JSON, unlimited)
|
||
|
||
| State | Dataset ID | Records | Status Field | Formation State Field | Notes |
|
||
|-------|-----------|---------|--------------|----------------------|-------|
|
||
| CO | `4ykn-tg5h` | ~3M | `entitystatus` | `jurisdictonofformation` | Fully loaded |
|
||
| IA | `ykb6-ywnd` | ~500K | `entity_status` | `home_state` | Working |
|
||
| CT | `n7gp-d28j` | ~1.2M | `status` | `state_of_formation` | Working |
|
||
| OR | `tckn-sxa6` | ~800K | `status` | `state_of_origin` | Active businesses only |
|
||
| NY | `n9v6-gdp6` | ~2M | N/A (active only) | `jurisdiction` | No status field — all records are active |
|
||
|
||
**API pattern:** `https://data.{state}.gov/resource/{id}.json?$limit=50000&$offset=0&$order=:id`
|
||
|
||
## Broken Socrata URLs (portals reorganized, need new IDs)
|
||
|
||
| State | Old ID | Notes |
|
||
|-------|--------|-------|
|
||
| WA | `7naq-cqm3` | 404. data.wa.gov catalog empty for business category |
|
||
| IL | `vqps-xatp` | 404. IL SOS prohibits bulk scraping officially |
|
||
| PA | `6ftj-q3fu` | 404. PA has `xvd7-5r2c` but no status field |
|
||
| MI | `uc6u-xab8` | 404. LARA portal, no confirmed free download |
|
||
| AK | `p2kg-xwxr` | DNS failure. data.alaska.gov may be deprecated |
|
||
| VT | `c7cm-s92n` | 404. VT open data portal reorganized |
|
||
|
||
## Free Bulk Download (non-Socrata)
|
||
|
||
| State | Source | Format | Cost | Fields | Status |
|
||
|-------|--------|--------|------|--------|--------|
|
||
| FL | Sunbiz FTP | Fixed-width ASCII | Free (register for FTP creds) | Name, status (A/I), filing type, date, EIN, address, RA, officers | Has status |
|
||
| VA | data.virginia.gov | XLSX (~86MB) | Free | Name, address, officers, status, type, creation date | Has status |
|
||
|
||
**FL download:** https://dos.fl.gov/sunbiz/other-services/data-downloads/
|
||
**VA download:** https://data.virginia.gov/dataset/corporation
|
||
|
||
## Free Subscription Downloads
|
||
|
||
| State | Source | Cost | Records | Notes |
|
||
|-------|--------|------|---------|-------|
|
||
| CA | bizfileOnline.sos.ca.gov | **FREE** (weekly subscription) | ~17M | Sign up at BizFileOnline → BE & UCC Bulk Orders → Weekly Data Download |
|
||
| FL | sftp.floridados.gov | **FREE** (SFTP) | ~4M | User: Public / Pass: PubAccess1845! — Quarterly full + daily diffs |
|
||
|
||
## Paid Bulk Data
|
||
|
||
| State | Source | Cost | Notes |
|
||
|-------|--------|------|-------|
|
||
| WY | SOS subscription form | $10K+/year | Too expensive — we scrape WyoBiz instead |
|
||
| TX | SOSDirect bulk orders | $20/month (weekly) or $1,350 one-time | https://direct.sos.state.tx.us/help/help-corp.asp?pg=bulk |
|
||
| TX | Comptroller franchise tax | **FREE** on data.texas.gov (xn8i-yb9w) | 3.2M records but SODA API returns empty — may need portal CSV export |
|
||
| MN | SOS data subscription | $30/week (free non-commercial) | CSV, delivered within 10 days |
|
||
| NE | SOS special request | $15 per 1,000 records | CSV with filters |
|
||
| AZ | Corp Commission form M027 | $75 partial / $1,000 full | Importable format |
|
||
| NC | SOS data subscription | $750 initial + $250/year | FTP weekly updates |
|
||
| LA | SOS office | $6,900–$12,500 | Too expensive |
|
||
|
||
## No Bulk Access (Playwright search only)
|
||
|
||
These states require live SOS portal searches via our Playwright adapters (~3-20s per lookup, cached 24h):
|
||
|
||
DE, IL, GA, MA, MD, NH, NJ, SC, SD, TN, KY, IN, MS, MO, WV, ND, OK, RI, HI, NM, NV (search API only), MT, NE (unless paid), AL, AR, KS, LA, ME
|
||
|
||
Our state adapters handle all 52 jurisdictions via `search_name()` for on-demand lookups.
|
||
|
||
## SEC EDGAR (public companies only)
|
||
|
||
For ~10K publicly-traded companies, SEC filings include authoritative state of incorporation:
|
||
- **Company list:** https://www.sec.gov/files/company_tickers.json
|
||
- **Detail:** https://data.sec.gov/submissions/CIK{padded_10}.json
|
||
- **Fields:** `stateOfIncorporation`, `name`, `ein`, `addresses`
|
||
- **Rate limit:** 10 req/sec, free, requires User-Agent header
|
||
- **Limitation:** Only SEC-registered filers (public companies, not private LLCs)
|
||
|
||
## Aggregator APIs
|
||
|
||
| Service | Free Tier | Coverage | Notes |
|
||
|---------|-----------|----------|-------|
|
||
| OpenCorporates | 200 calls/month | 170+ jurisdictions | Not viable for bulk. Paid plans start GBP 2,250/yr |
|
||
| Cobalt Intelligence | 20 free lookups | All 50 states | Credit-based paid API. Gold standard but expensive |
|
||
| Apify "US Business Entity Search" | Pay-per-use | 34 state registries | Uses SIP Public Data Gateway. Most comprehensive |
|
||
|
||
## Daily Cron
|
||
|
||
The `pw-entity-cache-refresh` timer runs at 07:00 UTC (2am CT) daily:
|
||
```
|
||
python -m scripts.formation.bulk_download --all
|
||
```
|
||
Downloads all configured Socrata states and upserts into `entity_cache`.
|
||
|
||
## Schema
|
||
|
||
```sql
|
||
-- entity_cache table (migration 009)
|
||
entity_name TEXT NOT NULL -- Uppercase
|
||
entity_number TEXT -- State filing number
|
||
entity_type TEXT -- LLC, CORPORATION, LP, NONPROFIT
|
||
status TEXT -- ACTIVE, DISSOLVED, SUSPENDED, DELINQUENT, INACTIVE
|
||
formation_date DATE
|
||
formation_state TEXT -- 2-letter code of state where entity was originally formed
|
||
registered_agent TEXT
|
||
principal_address TEXT
|
||
state TEXT NOT NULL -- State this record is registered in
|
||
source TEXT DEFAULT 'socrata'
|
||
|
||
UNIQUE(jurisdiction, entity_number)
|
||
INDEX gin_trgm on entity_name -- Fuzzy search
|
||
INDEX on state
|
||
INDEX on status
|
||
```
|