new-site

Author	SHA1	Message	Date
justin	7670608c1a	fix(monitoring): render alertmanager.yml from template at deploy (fixes crash loop) Alertmanager does not expand ${ENV} in its YAML, so the committed config with ${TELEGRAM_BOT_TOKEN}/${TELEGRAM_CHAT_ID} crash-looped it (line 24: cannot unmarshal !!str `${TELEG...` into int64) - 11k+ restarts on prod, alerting dead. - rename alertmanager.yml -> alertmanager.yml.template (keeps ${} placeholders) - deploy.sh: envsubst the template into the (gitignored) alertmanager.yml from .env, scoped to the two TELEGRAM vars so the {{ }} Go-template message survives - gitignore the rendered file (contains the bot token) - warns if the vars are unset	2026-06-07 04:49:53 -05:00
justin	92427291e6	Fix ContainerHighMemory alert: skip containers with no memory limit Containers without a memory limit have spec_memory_limit_bytes=0, causing division to produce +Inf which always fires. Added guard: only alert when a limit is actually set (> 0). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:54:16 -05:00
justin	15f5c267e7	Fix dashboard stale series + enable Prometheus admin API Dashboard queries now use max() to pick UP value when old stale probe targets coexist with new ones. Prometheus admin API enabled for future TSDB cleanup of stale series. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:43:42 -05:00
justin	3194c71495	Fix Forgejo probe: use HTTPS public URL (port 3000 conflicts with Grafana internally) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:38:36 -05:00
justin	b190bcef92	Fix ERPNext and Forgejo probes - ERPNext: custom blackbox module with Host: performancewest.net header (ERPNext multitenancy requires site name in Host for routing) - Forgejo: add extra_hosts to blackbox-exporter so it can resolve host.docker.internal to reach forgejo on port 3000 - Blackbox http_erpnext module: sets Host header, expects 200 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:35:45 -05:00
justin	f856434642	Fix service probes: correct endpoints and permissive HTTP module - Workers: use http_internal module (HTTP/1.0 SimpleHTTPServer) - ERPNext: use /api/method/ping, accept 401/403 (still means alive) - Listmonk: use /health not /api/health (403 without auth) - Forgejo: port 3000 not 3030 - Dev API: probe via HTTPS public URL (blackbox can't reach Docker) - Added http_internal blackbox module accepting HTTP/1.0 + 401/403 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:33:48 -05:00
justin	2f9005693e	Add deep service health monitoring for all PW dependencies Each service gets its own Prometheus probe verifying actual functionality: - API: /status endpoint (checks DB connectivity, returns 503 if down) - Workers: /health endpoint (job server responsive) - ERPNext: API method call (MariaDB + Redis + app all working) - MinIO: /minio/health/live (storage accessible) - Listmonk: /api/health (email service + DB) - Ollama: root endpoint (LLM inference available) - Umami: /api/heartbeat (analytics tracking) - Forgejo: root page (git server accessible) - PostgreSQL: pg_up metric from postgres-exporter - All HTTPS endpoints: SSL + reachability from outside Service-specific alerts with context: - API down = DB may be unreachable - Workers down = compliance orders not processing - ERPNext down = CRM inaccessible - MinIO down = document storage unavailable Custom Grafana dashboard: "Performance West — Services Overview" - Service status grid (UP/DOWN with colors) - Response time charts (internal + HTTPS) - SSL certificate expiry gauges - Container CPU/memory per service - PostgreSQL connections, nginx req/s, active alerts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:30:23 -05:00
justin	cc463a662f	Fix MinIO health probe: use internal Docker URL instead of public MinIO returns 403 when accessed via minio.performancewest.net because it interprets the Host header as a bucket name. Switch blackbox probe to internal http://minio:9000/minio/health/live which works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:26:46 -05:00
justin	0a31313956	Fix nginx-exporter: back to bridge network with host.docker.internal host network mode prevented Prometheus from reaching the exporter. Switched back to bridge with extra_hosts + explicit port mapping. Added timeout flag to prevent hanging on stub_status fetch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:21:27 -05:00
justin	433827138b	Fix nginx-exporter: use host network mode for direct stub_status access nginx-exporter couldn't reach host nginx via host.docker.internal (connection timeout). Switch to network_mode: host so it can access 127.0.0.1:8888 directly. Prometheus scrapes via host.docker.internal with extra_hosts mapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:19:57 -05:00
justin	27cc925c4d	Fix nginx-exporter port and add alertmanager scrape target - nginx stub_status moved to port 8888 (port 80 was being caught by other server blocks and returning 301) - nginx-exporter updated to scrape :8888 - Added alertmanager scrape job to Prometheus config (was missing, so alertmanager dashboard had no data) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:17:31 -05:00
justin	b298ec12b7	Remove fixed uid from Grafana datasource provisioning — Grafana 13 rejects it on fresh boot Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:09:10 -05:00
justin	fc324cf7b9	Fix Grafana datasource UID to match dashboard references Community dashboards reference datasource uid=prometheus but the auto-generated UID was random. Pin to uid=prometheus for compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 03:07:03 -05:00
justin	a4a5500bfc	Add Prometheus + Grafana + Alertmanager monitoring stack Full observability stack with Telegram alerting: Components: - Prometheus: metrics collection, 90-day retention - Grafana: dashboards at monitoring.performancewest.net - Alertmanager: routes alerts to Telegram bot - node-exporter: OS metrics (CPU, RAM, disk, network) - cAdvisor: container metrics (CPU, memory, restarts) - postgres-exporter: PostgreSQL connection/query metrics - nginx-exporter: request rate, 5xx errors, connections - blackbox-exporter: HTTP/TCP endpoint probing + SSL cert checks Alert rules: - Service down (HTTP probe, TCP port, container missing) - Container restart loops - High CPU/memory/disk/load - PostgreSQL down or high connections - SSL cert expiring (14d warning, 3d critical) - Slow HTTP responses, high 5xx rate Blackbox probes all public endpoints: performancewest.net, api, dev, crm, lists, analytics, minio, crypto, pay Telegram alerts: critical=1h repeat, warning=6h repeat, auto-resolve notifications Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 02:08:39 -05:00

14 commits