Commit graph

10 commits

Author SHA1 Message Date
justin
b190bcef92 Fix ERPNext and Forgejo probes
- ERPNext: custom blackbox module with Host: performancewest.net header
  (ERPNext multitenancy requires site name in Host for routing)
- Forgejo: add extra_hosts to blackbox-exporter so it can resolve
  host.docker.internal to reach forgejo on port 3000
- Blackbox http_erpnext module: sets Host header, expects 200

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:35:45 -05:00
justin
f856434642 Fix service probes: correct endpoints and permissive HTTP module
- Workers: use http_internal module (HTTP/1.0 SimpleHTTPServer)
- ERPNext: use /api/method/ping, accept 401/403 (still means alive)
- Listmonk: use /health not /api/health (403 without auth)
- Forgejo: port 3000 not 3030
- Dev API: probe via HTTPS public URL (blackbox can't reach Docker)
- Added http_internal blackbox module accepting HTTP/1.0 + 401/403

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:33:48 -05:00
justin
2f9005693e Add deep service health monitoring for all PW dependencies
Each service gets its own Prometheus probe verifying actual functionality:
- API: /status endpoint (checks DB connectivity, returns 503 if down)
- Workers: /health endpoint (job server responsive)
- ERPNext: API method call (MariaDB + Redis + app all working)
- MinIO: /minio/health/live (storage accessible)
- Listmonk: /api/health (email service + DB)
- Ollama: root endpoint (LLM inference available)
- Umami: /api/heartbeat (analytics tracking)
- Forgejo: root page (git server accessible)
- PostgreSQL: pg_up metric from postgres-exporter
- All HTTPS endpoints: SSL + reachability from outside

Service-specific alerts with context:
- API down = DB may be unreachable
- Workers down = compliance orders not processing
- ERPNext down = CRM inaccessible
- MinIO down = document storage unavailable

Custom Grafana dashboard: "Performance West — Services Overview"
- Service status grid (UP/DOWN with colors)
- Response time charts (internal + HTTPS)
- SSL certificate expiry gauges
- Container CPU/memory per service
- PostgreSQL connections, nginx req/s, active alerts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:30:23 -05:00
justin
cc463a662f Fix MinIO health probe: use internal Docker URL instead of public
MinIO returns 403 when accessed via minio.performancewest.net because
it interprets the Host header as a bucket name. Switch blackbox probe
to internal http://minio:9000/minio/health/live which works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:26:46 -05:00
justin
0a31313956 Fix nginx-exporter: back to bridge network with host.docker.internal
host network mode prevented Prometheus from reaching the exporter.
Switched back to bridge with extra_hosts + explicit port mapping.
Added timeout flag to prevent hanging on stub_status fetch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:21:27 -05:00
justin
433827138b Fix nginx-exporter: use host network mode for direct stub_status access
nginx-exporter couldn't reach host nginx via host.docker.internal
(connection timeout). Switch to network_mode: host so it can access
127.0.0.1:8888 directly. Prometheus scrapes via host.docker.internal
with extra_hosts mapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:19:57 -05:00
justin
27cc925c4d Fix nginx-exporter port and add alertmanager scrape target
- nginx stub_status moved to port 8888 (port 80 was being caught
  by other server blocks and returning 301)
- nginx-exporter updated to scrape :8888
- Added alertmanager scrape job to Prometheus config (was missing,
  so alertmanager dashboard had no data)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:17:31 -05:00
justin
b298ec12b7 Remove fixed uid from Grafana datasource provisioning — Grafana 13 rejects it on fresh boot
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:09:10 -05:00
justin
fc324cf7b9 Fix Grafana datasource UID to match dashboard references
Community dashboards reference datasource uid=prometheus but the
auto-generated UID was random. Pin to uid=prometheus for compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:07:03 -05:00
justin
a4a5500bfc Add Prometheus + Grafana + Alertmanager monitoring stack
Full observability stack with Telegram alerting:

Components:
- Prometheus: metrics collection, 90-day retention
- Grafana: dashboards at monitoring.performancewest.net
- Alertmanager: routes alerts to Telegram bot
- node-exporter: OS metrics (CPU, RAM, disk, network)
- cAdvisor: container metrics (CPU, memory, restarts)
- postgres-exporter: PostgreSQL connection/query metrics
- nginx-exporter: request rate, 5xx errors, connections
- blackbox-exporter: HTTP/TCP endpoint probing + SSL cert checks

Alert rules:
- Service down (HTTP probe, TCP port, container missing)
- Container restart loops
- High CPU/memory/disk/load
- PostgreSQL down or high connections
- SSL cert expiring (14d warning, 3d critical)
- Slow HTTP responses, high 5xx rate

Blackbox probes all public endpoints:
  performancewest.net, api, dev, crm, lists, analytics,
  minio, crypto, pay

Telegram alerts: critical=1h repeat, warning=6h repeat,
  auto-resolve notifications

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 02:08:39 -05:00