new-site/docs/production-runbook.md
justin f8cd37ac8c Initial commit — Performance West telecom compliance platform
Includes: API (Express/TypeScript), Astro site, Python workers,
document generators, FCC compliance tools, Canada CRTC formation,
Ansible infrastructure, and deployment scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 06:54:22 -05:00

273 lines
9.4 KiB
Markdown

# Production Runbook — FCC Filing + Treasury Stack
This runbook covers what an operator has to provision before the FCC filing
automation and crypto-treasury pipeline can run in production. Each section
lists the specific env vars, portal credentials, and one-time setup steps.
---
## 1. Admin dashboard auth (Blocker 1)
The admin dashboard and every `/api/v1/admin/*` endpoint is guarded by a JWT
signed with `ADMIN_JWT_SECRET`. The API refuses to boot in production if the
secret is still the built-in placeholder.
### One-time setup
1. Generate a strong random secret:
openssl rand -base64 48
2. Set on the API process (Docker / systemd env file):
ADMIN_JWT_SECRET=<paste output>
3. Provision an admin user:
psql "$DATABASE_URL" <<SQL
INSERT INTO admin_users (username, password_hash, display_name, email, active)
VALUES (
'justin',
crypt('<strong-password>', gen_salt('bf', 12)),
'Justin Tyson',
'ops@performancewest.net',
TRUE
);
SQL
(Use `bcryptjs` from Node to hash if `pgcrypto` is unavailable.)
4. Verify login:
curl -s -X POST https://api.performancewest.net/api/v1/admin/login \
-H 'Content-Type: application/json' \
-d '{"username":"justin","password":"<strong-password>"}'
### Related env vars
- `ADMIN_JWT_SECRET` — JWT signing secret. Required in production.
- `WEBHOOK_SECRET` — shared secret for ERPNext → API formation/CRTC webhooks.
- `SHKEEPER_API_KEY` — header used by SHKeeper to authenticate its callback.
- `STRIPE_WEBHOOK_SECRET` — verified by Stripe's HMAC signature check.
The startup guard in `api/src/config.ts` (`refuseInsecureProduction`) blocks
boot if any of the above are unset or still set to `change-this-in-production`.
---
## 2. USAC E-File storage state (Blocker 2)
USAC's E-File portal (https://www2.usac.org/cr/) requires a logged-in session
cookie to submit Form 499-A. We drive it via Playwright. The filer's session
(login cookies + MFA state) must be provisioned once per filing entity.
### One-time setup per telecom entity
1. Log in manually to E-File using the entity's FRN + the assigned
E-File administrator account.
2. Complete MFA (USAC MFA is TOTP-based as of 2026).
3. Export the session state to MinIO:
bucket: `playwright-storage`
key: `usac/<telecom_entity_id>/storage_state.json`
The filer reads this key at the start of each `fcc-499a` /
`fcc-499-initial` job. If missing or expired, the handler logs a
ToDo for the admin.
4. Renewal: USAC session expires ~14 days idle; the filer re-uses it as long
as it's valid, and the scheduled `usac_session_refresh` cron (every
7 days) re-logs in and re-exports. The cron requires a stored TOTP
secret:
ERPNext Sensitive ID: `usac-totp-<telecom_entity_id>`
### Env vars
- `PLAYWRIGHT_STORAGE_BUCKET=playwright-storage`
- `USAC_MFA_VIA=totp` (alternative: `sms` — not supported in automation)
### Related docs
- See `scripts/workers/services/form_499a.py` for the filer entry point.
- See `docs/fcc-references/499a-filing.md` for screen-by-screen form notes.
---
## 3. Relay debit card (Blocker 4)
Filing portal charges settle on `RELAY_FILING_CARD_ID` — a Relay debit card
whose balance is the Relay business account balance. Once Bridge offramps
crypto USD to Relay, the same balance funds the card.
### One-time setup
1. In the Relay dashboard → Cards → Issue card.
2. Virtual, unlimited (no per-transaction cap); lock to "Online purchases only".
3. Whitelist MCCs 9399 (government services) and 7372 (computer services).
4. Copy the card's internal id from Relay (visible in URL of the card detail
page) and set:
RELAY_FILING_CARD_ID=<card-id>
5. Fallback chain in `scripts/workers/relay_integration.py`:
CRYPTO_FILING_CARD_ID → STRIPE_FILING_CARD_ID →
PAYPAL_FILING_CARD_ID → RELAY_FILING_CARD_ID
For crypto-funded orders, set `PREFERRED_FUNDING_CARD=RELAY_FILING_CARD_ID`
so the Playwright filer charges Relay first.
### Statement reconciliation
- Daily: `scripts/workers/relay_deposit_monitor.py` parses Relay IMAP alerts
into `relay_deposits`. Offramp deposits have `source_kind='offramp_bridge'`;
vendor charges appear as outgoing card transactions.
- Monthly: export Relay statement CSV, import into `bookkeeping/imports/`, and
reconcile against `filing_fee_reservations.status='spent'` rows.
---
## 4. Webhook → worker dispatch chain
Confirmed wiring as of this commit:
1. `POST /api/v1/webhooks/stripe` → verifies Stripe HMAC →
`handlePaymentComplete(order_id, order_type, session_id)`.
2. `POST /api/v1/webhooks/shkeeper` → verifies `X-Shkeeper-Api-Key`
enqueues `crypto_payment_jobs` + calls `handlePaymentComplete`.
3. For compliance orders, `handlePaymentComplete`:
- Flips ERPNext Sales Order `workflow_state` to `Service Queued`.
- **Dispatches directly to the worker** at `${WORKER_URL}/jobs` with
`action=process_compliance_service` (no dependency on an ERPNext
Webhook fixture).
4. `POST /api/v1/webhooks/service/queued` (ERPNext-driven) remains as a
backup path — if you configure a Frappe Webhook on Sales Order
`workflow_state → Service Queued`, it fires the same worker action.
5. Worker `job_server.py:748` `handle_process_compliance_service` routes
to the handler from `SERVICE_HANDLERS[service_slug]`.
### Env vars
- `WORKER_URL=http://workers:8090` (internal Docker network name)
- `WEBHOOK_SECRET=<shared-with-ERPNext>`
- `SHKEEPER_API_KEY=<configured-in-SHKeeper-admin>`
- `STRIPE_WEBHOOK_SECRET=whsec_...` (from dashboard.stripe.com/webhooks)
### Verification
After deploying, confirm with:
# trigger a compliance test checkout
# then tail the API logs for these three lines per order:
[checkout] Payment confirmed: compliance CO-xxx via <method>
[checkout] Advanced compliance Sales Order SAL-xxx to Service Queued
[checkout] Worker dispatched: CO-xxx (<service-slug>)
# and the worker logs for:
[worker] process_compliance_service: CO-xxx (<handler>)
---
## 5. Crypto treasury env (manual mode)
Until Bridge is approved, treasury runs in **manual** mode — admin approves
every offramp before it touches Bridge.
CRYPTO_TREASURY_MODE=manual # default; flip to "auto" when Bridge is live
# Bridge (when approved):
BRIDGE_API_KEY=
BRIDGE_API_URL=https://api.bridge.xyz
BRIDGE_RELAY_EXTERNAL_ACCOUNT_ID=
BRIDGE_DEVELOPER_FEE_USD=0
RELAY_BANK_MEMO_PREFIX=PW-ORDER-
MAX_SLIPPAGE_BPS=300
# Cold wallet (Bridge approval not required to sweep — hardware wallet is live)
COLD_WALLET_BTC_ADDR=
COLD_WALLET_ETH_ADDR=
COLD_WALLET_USDC_ADDR=
COLD_WALLET_USDT_ADDR=
COLD_WALLET_HOT_FLOAT_USD_CENTS=50000
COLD_WALLET_AUTO_SWEEP_CEILING_USD_CENTS=500000
CRYPTO_SWEEP_ADMIN_EMAIL=ops@performancewest.net
In manual mode the `crypto_payment_worker` parks every `received` job at
`state='manual'` and an admin approves via
`POST /api/v1/admin/crypto-payments/:order_id/retry-offramp`.
---
## 6. Scheduled worker jobs (systemd timers)
Deployed by the `worker-crons` ansible role
(`infra/ansible/roles/worker-crons/`). Each timer runs
`docker compose exec -T workers python -m <module>` on its schedule.
| Timer | Cadence | Module |
|---|---|---|
| `pw-usf-factor-monitor.timer` | daily 09:00 CT | `scripts.workers.usf_factor_monitor` |
| `pw-deminimis-factor-check.timer` | daily 03:00 UTC | `scripts.workers.deminimis_factor_check` |
| `pw-cold-wallet-sweep.timer` | every 30 min | `scripts.workers.cold_wallet_sweeper` |
| `pw-crypto-payment-worker.timer` | every 60 s | `scripts.workers.crypto_payment_worker` |
| `pw-relay-deposit-monitor.timer` | every 5 min | `scripts.workers.relay_deposit_monitor` |
| `pw-commission-worker.timer` | daily 02:00 UTC | `scripts.workers.commission_worker` |
| `pw-renewal-worker.timer` | daily 04:00 UTC | `scripts.workers.renewal_worker` |
| `pw-cdr-retention.timer` | daily 05:00 UTC | `scripts.workers.cdr_retention_sweeper` |
| `pw-cdr-unlock-nudge.timer` | daily 10:00 CT | `scripts.workers.cdr_unlock_nudge` |
| `pw-payment-reminder.timer` | daily 11:00 CT | `scripts.workers.payment_reminder` |
| `pw-fcc-rmd-removed.timer` | weekly Wed 08:00 CT | `scripts.workers.fcc_rmd_removed_scraper` |
### Verification
# list active timers
systemctl list-timers 'pw-*'
# tail a specific job's history
journalctl -u pw-usf-factor-monitor.service --since '1 day ago'
# trigger a job ad-hoc for testing
systemctl start pw-deminimis-factor-check.service
### Adding a new cron
Add an entry to `infra/ansible/roles/worker-crons/defaults/main.yml`:
```yaml
- name: pw-my-new-job
description: What it does
module: scripts.workers.my_new_job
on_calendar: "*-*-* 06:00:00 UTC"
persistent: true # run on boot if missed
```
Then re-run `ansible-playbook playbooks/site.yml`.
---
## 7. Smoke tests
Run before every release:
# Service handler registry + CPNI/CALEA variant mapping
docker compose exec workers python -m scripts.tests.test_cpni_calea_variants
# Form 499 Initial handler guards
docker compose exec workers python -m scripts.tests.test_form_499_initial_smoke
Both return exit 0 on pass. Wire into CI.
---
## 8. Boot-time health checks
The API and worker services each expose:
- `GET /health` — returns 200 when config loaded + DB reachable.
- `GET /health/deep` — returns 200 only when ERPNext, MinIO, and the worker
message channel all respond.
Set these as the Docker HEALTHCHECK / K8s liveness probe so deploys fail fast
when secrets are missing.