Commit graph

13 commits

Author SHA1 Message Date
justin
92427291e6 Fix ContainerHighMemory alert: skip containers with no memory limit
Containers without a memory limit have spec_memory_limit_bytes=0,
causing division to produce +Inf which always fires. Added guard:
only alert when a limit is actually set (> 0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:54:16 -05:00
justin
15f5c267e7 Fix dashboard stale series + enable Prometheus admin API
Dashboard queries now use max() to pick UP value when old stale
probe targets coexist with new ones. Prometheus admin API enabled
for future TSDB cleanup of stale series.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:43:42 -05:00
justin
3194c71495 Fix Forgejo probe: use HTTPS public URL (port 3000 conflicts with Grafana internally)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:38:36 -05:00
justin
b190bcef92 Fix ERPNext and Forgejo probes
- ERPNext: custom blackbox module with Host: performancewest.net header
  (ERPNext multitenancy requires site name in Host for routing)
- Forgejo: add extra_hosts to blackbox-exporter so it can resolve
  host.docker.internal to reach forgejo on port 3000
- Blackbox http_erpnext module: sets Host header, expects 200

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:35:45 -05:00
justin
f856434642 Fix service probes: correct endpoints and permissive HTTP module
- Workers: use http_internal module (HTTP/1.0 SimpleHTTPServer)
- ERPNext: use /api/method/ping, accept 401/403 (still means alive)
- Listmonk: use /health not /api/health (403 without auth)
- Forgejo: port 3000 not 3030
- Dev API: probe via HTTPS public URL (blackbox can't reach Docker)
- Added http_internal blackbox module accepting HTTP/1.0 + 401/403

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:33:48 -05:00
justin
2f9005693e Add deep service health monitoring for all PW dependencies
Each service gets its own Prometheus probe verifying actual functionality:
- API: /status endpoint (checks DB connectivity, returns 503 if down)
- Workers: /health endpoint (job server responsive)
- ERPNext: API method call (MariaDB + Redis + app all working)
- MinIO: /minio/health/live (storage accessible)
- Listmonk: /api/health (email service + DB)
- Ollama: root endpoint (LLM inference available)
- Umami: /api/heartbeat (analytics tracking)
- Forgejo: root page (git server accessible)
- PostgreSQL: pg_up metric from postgres-exporter
- All HTTPS endpoints: SSL + reachability from outside

Service-specific alerts with context:
- API down = DB may be unreachable
- Workers down = compliance orders not processing
- ERPNext down = CRM inaccessible
- MinIO down = document storage unavailable

Custom Grafana dashboard: "Performance West — Services Overview"
- Service status grid (UP/DOWN with colors)
- Response time charts (internal + HTTPS)
- SSL certificate expiry gauges
- Container CPU/memory per service
- PostgreSQL connections, nginx req/s, active alerts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:30:23 -05:00
justin
cc463a662f Fix MinIO health probe: use internal Docker URL instead of public
MinIO returns 403 when accessed via minio.performancewest.net because
it interprets the Host header as a bucket name. Switch blackbox probe
to internal http://minio:9000/minio/health/live which works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:26:46 -05:00
justin
0a31313956 Fix nginx-exporter: back to bridge network with host.docker.internal
host network mode prevented Prometheus from reaching the exporter.
Switched back to bridge with extra_hosts + explicit port mapping.
Added timeout flag to prevent hanging on stub_status fetch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:21:27 -05:00
justin
433827138b Fix nginx-exporter: use host network mode for direct stub_status access
nginx-exporter couldn't reach host nginx via host.docker.internal
(connection timeout). Switch to network_mode: host so it can access
127.0.0.1:8888 directly. Prometheus scrapes via host.docker.internal
with extra_hosts mapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:19:57 -05:00
justin
27cc925c4d Fix nginx-exporter port and add alertmanager scrape target
- nginx stub_status moved to port 8888 (port 80 was being caught
  by other server blocks and returning 301)
- nginx-exporter updated to scrape :8888
- Added alertmanager scrape job to Prometheus config (was missing,
  so alertmanager dashboard had no data)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:17:31 -05:00
justin
b298ec12b7 Remove fixed uid from Grafana datasource provisioning — Grafana 13 rejects it on fresh boot
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:09:10 -05:00
justin
fc324cf7b9 Fix Grafana datasource UID to match dashboard references
Community dashboards reference datasource uid=prometheus but the
auto-generated UID was random. Pin to uid=prometheus for compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 03:07:03 -05:00
justin
a4a5500bfc Add Prometheus + Grafana + Alertmanager monitoring stack
Full observability stack with Telegram alerting:

Components:
- Prometheus: metrics collection, 90-day retention
- Grafana: dashboards at monitoring.performancewest.net
- Alertmanager: routes alerts to Telegram bot
- node-exporter: OS metrics (CPU, RAM, disk, network)
- cAdvisor: container metrics (CPU, memory, restarts)
- postgres-exporter: PostgreSQL connection/query metrics
- nginx-exporter: request rate, 5xx errors, connections
- blackbox-exporter: HTTP/TCP endpoint probing + SSL cert checks

Alert rules:
- Service down (HTTP probe, TCP port, container missing)
- Container restart loops
- High CPU/memory/disk/load
- PostgreSQL down or high connections
- SSL cert expiring (14d warning, 3d critical)
- Slow HTTP responses, high 5xx rate

Blackbox probes all public endpoints:
  performancewest.net, api, dev, crm, lists, analytics,
  minio, crypto, pay

Telegram alerts: critical=1h repeat, warning=6h repeat,
  auto-resolve notifications

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 02:08:39 -05:00