new-site/infra/monitoring
justin e318f12e36 infra: disk-space guardrail + Docker log rotation (prevent disk-full crash)
On 2026-06-27 / filled to 100% and crash-looped Postgres ('No space left on
device'), taking down Listmonk mid-send. Cause: an orphaned 15GB
/tmp/forgejo-dump.zip (interrupted backup) + uncapped Docker json-file logs
(forgejo container log alone was ~1GB), with NO disk monitoring to warn first.

- pw-disk-space-alert.sh + cron (every 15m): Telegram warn at 80%, auto-reclaim
  build cache + orphaned forgejo dump at 88%. Silent when healthy.
- ansible docker role: write /etc/docker/daemon.json with 50m x 3 log cap
  (150MB/container max) + non-disruptive Reload docker handler.
2026-06-27 09:47:29 -05:00
..
pw-disk-space-alert.sh infra: disk-space guardrail + Docker log rotation (prevent disk-full crash) 2026-06-27 09:47:29 -05:00
pw-warmup-tg-alert.sh fix(monitoring): repair both dead mail-alert crons + de-noise DMARC digest 2026-06-24 06:28:50 -05:00