crypto.performancewest.net kept going down because the shkeeper-deployment web pod periodically HANGS (HTTP server deadlocks while the apscheduler background thread keeps the process alive). The helm chart (shkeeper-1.7.15) ships NO liveness or readiness probe, so k8s saw the hung pod as Running and never restarted it, and kept routing traffic to the dead backend -> site down until a manual restart. Added HTTP probes on / :5000 (302 = healthy): liveness auto-restarts a hung pod, readiness pulls it from the Service endpoints. Applied live via kubectl patch (chart does not expose probes via values; re-apply after any helm upgrade -- command in the file header). Verified: new pod comes up READY 1/1 (probe passes) and crypto.performancewest.net serves 302 again.
41 lines
1.6 KiB
YAML
41 lines
1.6 KiB
YAML
# SHKeeper (crypto.performancewest.net) liveness + readiness probes.
|
|
#
|
|
# WHY: the shkeeper-deployment web pod (Flask/apscheduler) periodically HANGS --
|
|
# the HTTP server stops responding while the background apscheduler thread keeps
|
|
# the process alive. With NO liveness probe (the chart ships none), Kubernetes
|
|
# saw the pod as "Running" and never restarted it, and with no readiness probe
|
|
# the hung pod stayed in the Service endpoints -> crypto.performancewest.net 000
|
|
# until someone manually restarted the deployment. This is why it "kept going
|
|
# down". (Diagnosed 2026-06-09: pod Running 1/1 but HTTP to :30723 and even the
|
|
# in-cluster svc returned 000/hung.)
|
|
#
|
|
# FIX: HTTP probes on / :5000 (returns 302 = healthy). Liveness auto-restarts a
|
|
# hung pod; readiness pulls it from rotation. Chart shkeeper-1.7.15 does not
|
|
# expose probes via helm values, so this is applied as a kubectl strategic-merge
|
|
# patch (re-apply after any helm upgrade):
|
|
#
|
|
# KUBECONFIG=/etc/rancher/k3s/k3s.yaml \
|
|
# k3s kubectl patch deploy shkeeper-deployment -n shkeeper \
|
|
# --patch-file infra/k8s/shkeeper-liveness-probes-patch.yaml
|
|
#
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: shkeeper
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /
|
|
port: 5000
|
|
initialDelaySeconds: 60
|
|
periodSeconds: 30
|
|
timeoutSeconds: 10
|
|
failureThreshold: 3
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /
|
|
port: 5000
|
|
initialDelaySeconds: 15
|
|
periodSeconds: 15
|
|
timeoutSeconds: 8
|
|
failureThreshold: 2
|