Alertmanager does not expand ${ENV} in its YAML, so the committed config with
${TELEGRAM_BOT_TOKEN}/${TELEGRAM_CHAT_ID} crash-looped it (line 24: cannot
unmarshal !!str `${TELEG...` into int64) - 11k+ restarts on prod, alerting dead.
- rename alertmanager.yml -> alertmanager.yml.template (keeps ${} placeholders)
- deploy.sh: envsubst the template into the (gitignored) alertmanager.yml from
.env, scoped to the two TELEGRAM vars so the {{ }} Go-template message survives
- gitignore the rendered file (contains the bot token)
- warns if the vars are unset
Containers without a memory limit have spec_memory_limit_bytes=0,
causing division to produce +Inf which always fires. Added guard:
only alert when a limit is actually set (> 0).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dashboard queries now use max() to pick UP value when old stale
probe targets coexist with new ones. Prometheus admin API enabled
for future TSDB cleanup of stale series.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ERPNext: custom blackbox module with Host: performancewest.net header
(ERPNext multitenancy requires site name in Host for routing)
- Forgejo: add extra_hosts to blackbox-exporter so it can resolve
host.docker.internal to reach forgejo on port 3000
- Blackbox http_erpnext module: sets Host header, expects 200
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Workers: use http_internal module (HTTP/1.0 SimpleHTTPServer)
- ERPNext: use /api/method/ping, accept 401/403 (still means alive)
- Listmonk: use /health not /api/health (403 without auth)
- Forgejo: port 3000 not 3030
- Dev API: probe via HTTPS public URL (blackbox can't reach Docker)
- Added http_internal blackbox module accepting HTTP/1.0 + 401/403
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each service gets its own Prometheus probe verifying actual functionality:
- API: /status endpoint (checks DB connectivity, returns 503 if down)
- Workers: /health endpoint (job server responsive)
- ERPNext: API method call (MariaDB + Redis + app all working)
- MinIO: /minio/health/live (storage accessible)
- Listmonk: /api/health (email service + DB)
- Ollama: root endpoint (LLM inference available)
- Umami: /api/heartbeat (analytics tracking)
- Forgejo: root page (git server accessible)
- PostgreSQL: pg_up metric from postgres-exporter
- All HTTPS endpoints: SSL + reachability from outside
Service-specific alerts with context:
- API down = DB may be unreachable
- Workers down = compliance orders not processing
- ERPNext down = CRM inaccessible
- MinIO down = document storage unavailable
Custom Grafana dashboard: "Performance West — Services Overview"
- Service status grid (UP/DOWN with colors)
- Response time charts (internal + HTTPS)
- SSL certificate expiry gauges
- Container CPU/memory per service
- PostgreSQL connections, nginx req/s, active alerts
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MinIO returns 403 when accessed via minio.performancewest.net because
it interprets the Host header as a bucket name. Switch blackbox probe
to internal http://minio:9000/minio/health/live which works correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
host network mode prevented Prometheus from reaching the exporter.
Switched back to bridge with extra_hosts + explicit port mapping.
Added timeout flag to prevent hanging on stub_status fetch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nginx-exporter couldn't reach host nginx via host.docker.internal
(connection timeout). Switch to network_mode: host so it can access
127.0.0.1:8888 directly. Prometheus scrapes via host.docker.internal
with extra_hosts mapping.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- nginx stub_status moved to port 8888 (port 80 was being caught
by other server blocks and returning 301)
- nginx-exporter updated to scrape :8888
- Added alertmanager scrape job to Prometheus config (was missing,
so alertmanager dashboard had no data)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community dashboards reference datasource uid=prometheus but the
auto-generated UID was random. Pin to uid=prometheus for compatibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>