Initial commit — Performance West telecom compliance platform

Includes: API (Express/TypeScript), Astro site, Python workers,
document generators, FCC compliance tools, Canada CRTC formation,
Ansible infrastructure, and deployment scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
justin 2026-04-27 06:54:22 -05:00
commit f8cd37ac8c
1823 changed files with 145167 additions and 0 deletions

View file

@ -0,0 +1,111 @@
# Entity Cache Data Sources
Bulk business entity data for the corporation status check feature.
Updated: 2026-04-20
## Working Socrata SODA API States (free, JSON, unlimited)
| State | Dataset ID | Records | Status Field | Formation State Field | Notes |
|-------|-----------|---------|--------------|----------------------|-------|
| CO | `4ykn-tg5h` | ~3M | `entitystatus` | `jurisdictonofformation` | Fully loaded |
| IA | `ykb6-ywnd` | ~500K | `entity_status` | `home_state` | Working |
| CT | `n7gp-d28j` | ~1.2M | `status` | `state_of_formation` | Working |
| OR | `tckn-sxa6` | ~800K | `status` | `state_of_origin` | Active businesses only |
| NY | `n9v6-gdp6` | ~2M | N/A (active only) | `jurisdiction` | No status field — all records are active |
**API pattern:** `https://data.{state}.gov/resource/{id}.json?$limit=50000&$offset=0&$order=:id`
## Broken Socrata URLs (portals reorganized, need new IDs)
| State | Old ID | Notes |
|-------|--------|-------|
| WA | `7naq-cqm3` | 404. data.wa.gov catalog empty for business category |
| IL | `vqps-xatp` | 404. IL SOS prohibits bulk scraping officially |
| PA | `6ftj-q3fu` | 404. PA has `xvd7-5r2c` but no status field |
| MI | `uc6u-xab8` | 404. LARA portal, no confirmed free download |
| AK | `p2kg-xwxr` | DNS failure. data.alaska.gov may be deprecated |
| VT | `c7cm-s92n` | 404. VT open data portal reorganized |
## Free Bulk Download (non-Socrata)
| State | Source | Format | Cost | Fields | Status |
|-------|--------|--------|------|--------|--------|
| FL | Sunbiz FTP | Fixed-width ASCII | Free (register for FTP creds) | Name, status (A/I), filing type, date, EIN, address, RA, officers | Has status |
| VA | data.virginia.gov | XLSX (~86MB) | Free | Name, address, officers, status, type, creation date | Has status |
**FL download:** https://dos.fl.gov/sunbiz/other-services/data-downloads/
**VA download:** https://data.virginia.gov/dataset/corporation
## Free Subscription Downloads
| State | Source | Cost | Records | Notes |
|-------|--------|------|---------|-------|
| CA | bizfileOnline.sos.ca.gov | **FREE** (weekly subscription) | ~17M | Sign up at BizFileOnline → BE & UCC Bulk Orders → Weekly Data Download |
| FL | sftp.floridados.gov | **FREE** (SFTP) | ~4M | User: Public / Pass: PubAccess1845! — Quarterly full + daily diffs |
## Paid Bulk Data
| State | Source | Cost | Notes |
|-------|--------|------|-------|
| WY | SOS subscription form | $10K+/year | Too expensive — we scrape WyoBiz instead |
| TX | SOSDirect bulk orders | $20/month (weekly) or $1,350 one-time | https://direct.sos.state.tx.us/help/help-corp.asp?pg=bulk |
| TX | Comptroller franchise tax | **FREE** on data.texas.gov (xn8i-yb9w) | 3.2M records but SODA API returns empty — may need portal CSV export |
| MN | SOS data subscription | $30/week (free non-commercial) | CSV, delivered within 10 days |
| NE | SOS special request | $15 per 1,000 records | CSV with filters |
| AZ | Corp Commission form M027 | $75 partial / $1,000 full | Importable format |
| NC | SOS data subscription | $750 initial + $250/year | FTP weekly updates |
| LA | SOS office | $6,900$12,500 | Too expensive |
## No Bulk Access (Playwright search only)
These states require live SOS portal searches via our Playwright adapters (~3-20s per lookup, cached 24h):
DE, IL, GA, MA, MD, NH, NJ, SC, SD, TN, KY, IN, MS, MO, WV, ND, OK, RI, HI, NM, NV (search API only), MT, NE (unless paid), AL, AR, KS, LA, ME
Our state adapters handle all 52 jurisdictions via `search_name()` for on-demand lookups.
## SEC EDGAR (public companies only)
For ~10K publicly-traded companies, SEC filings include authoritative state of incorporation:
- **Company list:** https://www.sec.gov/files/company_tickers.json
- **Detail:** https://data.sec.gov/submissions/CIK{padded_10}.json
- **Fields:** `stateOfIncorporation`, `name`, `ein`, `addresses`
- **Rate limit:** 10 req/sec, free, requires User-Agent header
- **Limitation:** Only SEC-registered filers (public companies, not private LLCs)
## Aggregator APIs
| Service | Free Tier | Coverage | Notes |
|---------|-----------|----------|-------|
| OpenCorporates | 200 calls/month | 170+ jurisdictions | Not viable for bulk. Paid plans start GBP 2,250/yr |
| Cobalt Intelligence | 20 free lookups | All 50 states | Credit-based paid API. Gold standard but expensive |
| Apify "US Business Entity Search" | Pay-per-use | 34 state registries | Uses SIP Public Data Gateway. Most comprehensive |
## Daily Cron
The `pw-entity-cache-refresh` timer runs at 07:00 UTC (2am CT) daily:
```
python -m scripts.formation.bulk_download --all
```
Downloads all configured Socrata states and upserts into `entity_cache`.
## Schema
```sql
-- entity_cache table (migration 009)
entity_name TEXT NOT NULL -- Uppercase
entity_number TEXT -- State filing number
entity_type TEXT -- LLC, CORPORATION, LP, NONPROFIT
status TEXT -- ACTIVE, DISSOLVED, SUSPENDED, DELINQUENT, INACTIVE
formation_date DATE
formation_state TEXT -- 2-letter code of state where entity was originally formed
registered_agent TEXT
principal_address TEXT
state TEXT NOT NULL -- State this record is registered in
source TEXT DEFAULT 'socrata'
UNIQUE(jurisdiction, entity_number)
INDEX gin_trgm on entity_name -- Fuzzy search
INDEX on state
INDEX on status
```