From 9b9d3179161a343d462cccb342b2242888ef8521 Mon Sep 17 00:00:00 2001
From: justin <justin@liquidator.optimal-reality.com>
Date: Tue, 9 Jun 2026 04:57:50 -0500
Subject: [PATCH] infra/k8s: shkeeper liveness+readiness probes (fix recurring
 crypto.performancewest.net downtime)

crypto.performancewest.net kept going down because the shkeeper-deployment web
pod periodically HANGS (HTTP server deadlocks while the apscheduler background
thread keeps the process alive). The helm chart (shkeeper-1.7.15) ships NO
liveness or readiness probe, so k8s saw the hung pod as Running and never
restarted it, and kept routing traffic to the dead backend -> site down until a
manual restart.

Added HTTP probes on / :5000 (302 = healthy): liveness auto-restarts a hung pod,
readiness pulls it from the Service endpoints. Applied live via kubectl patch
(chart does not expose probes via values; re-apply after any helm upgrade --
command in the file header). Verified: new pod comes up READY 1/1 (probe passes)
and crypto.performancewest.net serves 302 again.
---
 infra/k8s/shkeeper-liveness-probes-patch.yaml | 41 +++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 infra/k8s/shkeeper-liveness-probes-patch.yaml

diff --git a/infra/k8s/shkeeper-liveness-probes-patch.yaml b/infra/k8s/shkeeper-liveness-probes-patch.yaml
new file mode 100644
index 0000000..2557eaa
--- /dev/null
+++ b/infra/k8s/shkeeper-liveness-probes-patch.yaml
@@ -0,0 +1,41 @@
+# SHKeeper (crypto.performancewest.net) liveness + readiness probes.
+#
+# WHY: the shkeeper-deployment web pod (Flask/apscheduler) periodically HANGS --
+# the HTTP server stops responding while the background apscheduler thread keeps
+# the process alive. With NO liveness probe (the chart ships none), Kubernetes
+# saw the pod as "Running" and never restarted it, and with no readiness probe
+# the hung pod stayed in the Service endpoints -> crypto.performancewest.net 000
+# until someone manually restarted the deployment. This is why it "kept going
+# down". (Diagnosed 2026-06-09: pod Running 1/1 but HTTP to :30723 and even the
+# in-cluster svc returned 000/hung.)
+#
+# FIX: HTTP probes on / :5000 (returns 302 = healthy). Liveness auto-restarts a
+# hung pod; readiness pulls it from rotation. Chart shkeeper-1.7.15 does not
+# expose probes via helm values, so this is applied as a kubectl strategic-merge
+# patch (re-apply after any helm upgrade):
+#
+#   KUBECONFIG=/etc/rancher/k3s/k3s.yaml \
+#   k3s kubectl patch deploy shkeeper-deployment -n shkeeper \
+#       --patch-file infra/k8s/shkeeper-liveness-probes-patch.yaml
+#
+spec:
+  template:
+    spec:
+      containers:
+        - name: shkeeper
+          livenessProbe:
+            httpGet:
+              path: /
+              port: 5000
+            initialDelaySeconds: 60
+            periodSeconds: 30
+            timeoutSeconds: 10
+            failureThreshold: 3
+          readinessProbe:
+            httpGet:
+              path: /
+              port: 5000
+            initialDelaySeconds: 15
+            periodSeconds: 15
+            timeoutSeconds: 8
+            failureThreshold: 2