infra/k8s: shkeeper liveness+readiness probes (fix recurring crypto.performancewest.net downtime)

crypto.performancewest.net kept going down because the shkeeper-deployment web pod periodically HANGS (HTTP server deadlocks while the apscheduler background thread keeps the process alive). The helm chart (shkeeper-1.7.15) ships NO liveness or readiness probe, so k8s saw the hung pod as Running and never restarted it, and kept routing traffic to the dead backend -> site down until a manual restart. Added HTTP probes on / :5000 (302 = healthy): liveness auto-restarts a hung pod, readiness pulls it from the Service endpoints. Applied live via kubectl patch (chart does not expose probes via values; re-apply after any helm upgrade -- command in the file header). Verified: new pod comes up READY 1/1 (probe passes) and crypto.performancewest.net serves 302 again.
2026-06-09 04:57:50 -05:00 · 2026-06-09 04:57:50 -05:00 · 9b9d317916
commit 9b9d317916
parent a308aeed6b
1 changed files with 41 additions and 0 deletions
--- a/infra/k8s/shkeeper-liveness-probes-patch.yaml
+++ b/infra/k8s/shkeeper-liveness-probes-patch.yaml
@ -0,0 +1,41 @@
+# SHKeeper (crypto.performancewest.net) liveness + readiness probes.
+#
+# WHY: the shkeeper-deployment web pod (Flask/apscheduler) periodically HANGS --
+# the HTTP server stops responding while the background apscheduler thread keeps
+# the process alive. With NO liveness probe (the chart ships none), Kubernetes
+# saw the pod as "Running" and never restarted it, and with no readiness probe
+# the hung pod stayed in the Service endpoints -> crypto.performancewest.net 000
+# until someone manually restarted the deployment. This is why it "kept going
+# down". (Diagnosed 2026-06-09: pod Running 1/1 but HTTP to :30723 and even the
+# in-cluster svc returned 000/hung.)
+#
+# FIX: HTTP probes on / :5000 (returns 302 = healthy). Liveness auto-restarts a
+# hung pod; readiness pulls it from rotation. Chart shkeeper-1.7.15 does not
+# expose probes via helm values, so this is applied as a kubectl strategic-merge
+# patch (re-apply after any helm upgrade):
+#
+#   KUBECONFIG=/etc/rancher/k3s/k3s.yaml \
+#   k3s kubectl patch deploy shkeeper-deployment -n shkeeper \
+#       --patch-file infra/k8s/shkeeper-liveness-probes-patch.yaml
+#
+spec:
+  template:
+    spec:
+      containers:
+        - name: shkeeper
+          livenessProbe:
+            httpGet:
+              path: /
+              port: 5000
+            initialDelaySeconds: 60
+            periodSeconds: 30
+            timeoutSeconds: 10
+            failureThreshold: 3
+          readinessProbe:
+            httpGet:
+              path: /
+              port: 5000
+            initialDelaySeconds: 15
+            periodSeconds: 15
+            timeoutSeconds: 8
+            failureThreshold: 2