new-site/monitoring
justin 2f9005693e Add deep service health monitoring for all PW dependencies
Each service gets its own Prometheus probe verifying actual functionality:
- API: /status endpoint (checks DB connectivity, returns 503 if down)
- Workers: /health endpoint (job server responsive)
- ERPNext: API method call (MariaDB + Redis + app all working)
- MinIO: /minio/health/live (storage accessible)
- Listmonk: /api/health (email service + DB)
- Ollama: root endpoint (LLM inference available)
- Umami: /api/heartbeat (analytics tracking)
- Forgejo: root page (git server accessible)
- PostgreSQL: pg_up metric from postgres-exporter
- All HTTPS endpoints: SSL + reachability from outside

Service-specific alerts with context:
- API down = DB may be unreachable
- Workers down = compliance orders not processing
- ERPNext down = CRM inaccessible
- MinIO down = document storage unavailable

Custom Grafana dashboard: "Performance West — Services Overview"
- Service status grid (UP/DOWN with colors)
- Response time charts (internal + HTTPS)
- SSL certificate expiry gauges
- Container CPU/memory per service
- PostgreSQL connections, nginx req/s, active alerts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:30:23 -05:00
..
alert_rules.yml Add deep service health monitoring for all PW dependencies 2026-05-01 03:30:23 -05:00
alertmanager.yml Add Prometheus + Grafana + Alertmanager monitoring stack 2026-05-01 02:08:39 -05:00
blackbox.yml Add Prometheus + Grafana + Alertmanager monitoring stack 2026-05-01 02:08:39 -05:00
grafana-datasources.yml Remove fixed uid from Grafana datasource provisioning — Grafana 13 rejects it on fresh boot 2026-05-01 03:09:10 -05:00
prometheus.yml Add deep service health monitoring for all PW dependencies 2026-05-01 03:30:23 -05:00
pw-services-dashboard.json Add deep service health monitoring for all PW dependencies 2026-05-01 03:30:23 -05:00