new-site

Author	SHA1	Message	Date
justin	b190bcef92	Fix ERPNext and Forgejo probes - ERPNext: custom blackbox module with Host: performancewest.net header (ERPNext multitenancy requires site name in Host for routing) - Forgejo: add extra_hosts to blackbox-exporter so it can resolve host.docker.internal to reach forgejo on port 3000 - Blackbox http_erpnext module: sets Host header, expects 200 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:35:45 -05:00
justin	f856434642	Fix service probes: correct endpoints and permissive HTTP module - Workers: use http_internal module (HTTP/1.0 SimpleHTTPServer) - ERPNext: use /api/method/ping, accept 401/403 (still means alive) - Listmonk: use /health not /api/health (403 without auth) - Forgejo: port 3000 not 3030 - Dev API: probe via HTTPS public URL (blackbox can't reach Docker) - Added http_internal blackbox module accepting HTTP/1.0 + 401/403 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:33:48 -05:00
justin	2f9005693e	Add deep service health monitoring for all PW dependencies Each service gets its own Prometheus probe verifying actual functionality: - API: /status endpoint (checks DB connectivity, returns 503 if down) - Workers: /health endpoint (job server responsive) - ERPNext: API method call (MariaDB + Redis + app all working) - MinIO: /minio/health/live (storage accessible) - Listmonk: /api/health (email service + DB) - Ollama: root endpoint (LLM inference available) - Umami: /api/heartbeat (analytics tracking) - Forgejo: root page (git server accessible) - PostgreSQL: pg_up metric from postgres-exporter - All HTTPS endpoints: SSL + reachability from outside Service-specific alerts with context: - API down = DB may be unreachable - Workers down = compliance orders not processing - ERPNext down = CRM inaccessible - MinIO down = document storage unavailable Custom Grafana dashboard: "Performance West — Services Overview" - Service status grid (UP/DOWN with colors) - Response time charts (internal + HTTPS) - SSL certificate expiry gauges - Container CPU/memory per service - PostgreSQL connections, nginx req/s, active alerts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:30:23 -05:00
justin	cc463a662f	Fix MinIO health probe: use internal Docker URL instead of public MinIO returns 403 when accessed via minio.performancewest.net because it interprets the Host header as a bucket name. Switch blackbox probe to internal http://minio:9000/minio/health/live which works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:26:46 -05:00
justin	0a31313956	Fix nginx-exporter: back to bridge network with host.docker.internal host network mode prevented Prometheus from reaching the exporter. Switched back to bridge with extra_hosts + explicit port mapping. Added timeout flag to prevent hanging on stub_status fetch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:21:27 -05:00
justin	433827138b	Fix nginx-exporter: use host network mode for direct stub_status access nginx-exporter couldn't reach host nginx via host.docker.internal (connection timeout). Switch to network_mode: host so it can access 127.0.0.1:8888 directly. Prometheus scrapes via host.docker.internal with extra_hosts mapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:19:57 -05:00
justin	27cc925c4d	Fix nginx-exporter port and add alertmanager scrape target - nginx stub_status moved to port 8888 (port 80 was being caught by other server blocks and returning 301) - nginx-exporter updated to scrape :8888 - Added alertmanager scrape job to Prometheus config (was missing, so alertmanager dashboard had no data) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:17:31 -05:00
justin	b298ec12b7	Remove fixed uid from Grafana datasource provisioning — Grafana 13 rejects it on fresh boot Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:09:10 -05:00
justin	fc324cf7b9	Fix Grafana datasource UID to match dashboard references Community dashboards reference datasource uid=prometheus but the auto-generated UID was random. Pin to uid=prometheus for compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:07:03 -05:00
justin	a4a5500bfc	Add Prometheus + Grafana + Alertmanager monitoring stack Full observability stack with Telegram alerting: Components: - Prometheus: metrics collection, 90-day retention - Grafana: dashboards at monitoring.performancewest.net - Alertmanager: routes alerts to Telegram bot - node-exporter: OS metrics (CPU, RAM, disk, network) - cAdvisor: container metrics (CPU, memory, restarts) - postgres-exporter: PostgreSQL connection/query metrics - nginx-exporter: request rate, 5xx errors, connections - blackbox-exporter: HTTP/TCP endpoint probing + SSL cert checks Alert rules: - Service down (HTTP probe, TCP port, container missing) - Container restart loops - High CPU/memory/disk/load - PostgreSQL down or high connections - SSL cert expiring (14d warning, 3d critical) - Slow HTTP responses, high 5xx rate Blackbox probes all public endpoints: performancewest.net, api, dev, crm, lists, analytics, minio, crypto, pay Telegram alerts: critical=1h repeat, warning=6h repeat, auto-resolve notifications Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 02:08:39 -05:00

10 commits