From 9b9d3179161a343d462cccb342b2242888ef8521 Mon Sep 17 00:00:00 2001 From: justin Date: Tue, 9 Jun 2026 04:57:50 -0500 Subject: [PATCH] infra/k8s: shkeeper liveness+readiness probes (fix recurring crypto.performancewest.net downtime) crypto.performancewest.net kept going down because the shkeeper-deployment web pod periodically HANGS (HTTP server deadlocks while the apscheduler background thread keeps the process alive). The helm chart (shkeeper-1.7.15) ships NO liveness or readiness probe, so k8s saw the hung pod as Running and never restarted it, and kept routing traffic to the dead backend -> site down until a manual restart. Added HTTP probes on / :5000 (302 = healthy): liveness auto-restarts a hung pod, readiness pulls it from the Service endpoints. Applied live via kubectl patch (chart does not expose probes via values; re-apply after any helm upgrade -- command in the file header). Verified: new pod comes up READY 1/1 (probe passes) and crypto.performancewest.net serves 302 again. --- infra/k8s/shkeeper-liveness-probes-patch.yaml | 41 +++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 infra/k8s/shkeeper-liveness-probes-patch.yaml diff --git a/infra/k8s/shkeeper-liveness-probes-patch.yaml b/infra/k8s/shkeeper-liveness-probes-patch.yaml new file mode 100644 index 0000000..2557eaa --- /dev/null +++ b/infra/k8s/shkeeper-liveness-probes-patch.yaml @@ -0,0 +1,41 @@ +# SHKeeper (crypto.performancewest.net) liveness + readiness probes. +# +# WHY: the shkeeper-deployment web pod (Flask/apscheduler) periodically HANGS -- +# the HTTP server stops responding while the background apscheduler thread keeps +# the process alive. With NO liveness probe (the chart ships none), Kubernetes +# saw the pod as "Running" and never restarted it, and with no readiness probe +# the hung pod stayed in the Service endpoints -> crypto.performancewest.net 000 +# until someone manually restarted the deployment. This is why it "kept going +# down". (Diagnosed 2026-06-09: pod Running 1/1 but HTTP to :30723 and even the +# in-cluster svc returned 000/hung.) +# +# FIX: HTTP probes on / :5000 (returns 302 = healthy). Liveness auto-restarts a +# hung pod; readiness pulls it from rotation. Chart shkeeper-1.7.15 does not +# expose probes via helm values, so this is applied as a kubectl strategic-merge +# patch (re-apply after any helm upgrade): +# +# KUBECONFIG=/etc/rancher/k3s/k3s.yaml \ +# k3s kubectl patch deploy shkeeper-deployment -n shkeeper \ +# --patch-file infra/k8s/shkeeper-liveness-probes-patch.yaml +# +spec: + template: + spec: + containers: + - name: shkeeper + livenessProbe: + httpGet: + path: / + port: 5000 + initialDelaySeconds: 60 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 + readinessProbe: + httpGet: + path: / + port: 5000 + initialDelaySeconds: 15 + periodSeconds: 15 + timeoutSeconds: 8 + failureThreshold: 2