new-site/infra/k8s/shkeeper-liveness-probes-patch.yaml
justin 9b9d317916 infra/k8s: shkeeper liveness+readiness probes (fix recurring crypto.performancewest.net downtime)
crypto.performancewest.net kept going down because the shkeeper-deployment web
pod periodically HANGS (HTTP server deadlocks while the apscheduler background
thread keeps the process alive). The helm chart (shkeeper-1.7.15) ships NO
liveness or readiness probe, so k8s saw the hung pod as Running and never
restarted it, and kept routing traffic to the dead backend -> site down until a
manual restart.

Added HTTP probes on / :5000 (302 = healthy): liveness auto-restarts a hung pod,
readiness pulls it from the Service endpoints. Applied live via kubectl patch
(chart does not expose probes via values; re-apply after any helm upgrade --
command in the file header). Verified: new pod comes up READY 1/1 (probe passes)
and crypto.performancewest.net serves 302 again.
2026-06-09 04:57:50 -05:00

41 lines
1.6 KiB
YAML

# SHKeeper (crypto.performancewest.net) liveness + readiness probes.
#
# WHY: the shkeeper-deployment web pod (Flask/apscheduler) periodically HANGS --
# the HTTP server stops responding while the background apscheduler thread keeps
# the process alive. With NO liveness probe (the chart ships none), Kubernetes
# saw the pod as "Running" and never restarted it, and with no readiness probe
# the hung pod stayed in the Service endpoints -> crypto.performancewest.net 000
# until someone manually restarted the deployment. This is why it "kept going
# down". (Diagnosed 2026-06-09: pod Running 1/1 but HTTP to :30723 and even the
# in-cluster svc returned 000/hung.)
#
# FIX: HTTP probes on / :5000 (returns 302 = healthy). Liveness auto-restarts a
# hung pod; readiness pulls it from rotation. Chart shkeeper-1.7.15 does not
# expose probes via helm values, so this is applied as a kubectl strategic-merge
# patch (re-apply after any helm upgrade):
#
# KUBECONFIG=/etc/rancher/k3s/k3s.yaml \
# k3s kubectl patch deploy shkeeper-deployment -n shkeeper \
# --patch-file infra/k8s/shkeeper-liveness-probes-patch.yaml
#
spec:
template:
spec:
containers:
- name: shkeeper
livenessProbe:
httpGet:
path: /
port: 5000
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /
port: 5000
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 8
failureThreshold: 2