Commit graph

2 commits

Author SHA1 Message Date
justin
15f5c267e7 Fix dashboard stale series + enable Prometheus admin API
Dashboard queries now use max() to pick UP value when old stale
probe targets coexist with new ones. Prometheus admin API enabled
for future TSDB cleanup of stale series.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:43:42 -05:00
justin
2f9005693e Add deep service health monitoring for all PW dependencies
Each service gets its own Prometheus probe verifying actual functionality:
- API: /status endpoint (checks DB connectivity, returns 503 if down)
- Workers: /health endpoint (job server responsive)
- ERPNext: API method call (MariaDB + Redis + app all working)
- MinIO: /minio/health/live (storage accessible)
- Listmonk: /api/health (email service + DB)
- Ollama: root endpoint (LLM inference available)
- Umami: /api/heartbeat (analytics tracking)
- Forgejo: root page (git server accessible)
- PostgreSQL: pg_up metric from postgres-exporter
- All HTTPS endpoints: SSL + reachability from outside

Service-specific alerts with context:
- API down = DB may be unreachable
- Workers down = compliance orders not processing
- ERPNext down = CRM inaccessible
- MinIO down = document storage unavailable

Custom Grafana dashboard: "Performance West — Services Overview"
- Service status grid (UP/DOWN with colors)
- Response time charts (internal + HTTPS)
- SSL certificate expiry gauges
- Container CPU/memory per service
- PostgreSQL connections, nginx req/s, active alerts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:30:23 -05:00