infra/k8s: shkeeper liveness+readiness probes (fix recurring crypto.performancewest.net downtime)
crypto.performancewest.net kept going down because the shkeeper-deployment web pod periodically HANGS (HTTP server deadlocks while the apscheduler background thread keeps the process alive). The helm chart (shkeeper-1.7.15) ships NO liveness or readiness probe, so k8s saw the hung pod as Running and never restarted it, and kept routing traffic to the dead backend -> site down until a manual restart. Added HTTP probes on / :5000 (302 = healthy): liveness auto-restarts a hung pod, readiness pulls it from the Service endpoints. Applied live via kubectl patch (chart does not expose probes via values; re-apply after any helm upgrade -- command in the file header). Verified: new pod comes up READY 1/1 (probe passes) and crypto.performancewest.net serves 302 again.
This commit is contained in:
parent
a308aeed6b
commit
9b9d317916
1 changed files with 41 additions and 0 deletions
41
infra/k8s/shkeeper-liveness-probes-patch.yaml
Normal file
41
infra/k8s/shkeeper-liveness-probes-patch.yaml
Normal file
|
|
@ -0,0 +1,41 @@
|
|||
# SHKeeper (crypto.performancewest.net) liveness + readiness probes.
|
||||
#
|
||||
# WHY: the shkeeper-deployment web pod (Flask/apscheduler) periodically HANGS --
|
||||
# the HTTP server stops responding while the background apscheduler thread keeps
|
||||
# the process alive. With NO liveness probe (the chart ships none), Kubernetes
|
||||
# saw the pod as "Running" and never restarted it, and with no readiness probe
|
||||
# the hung pod stayed in the Service endpoints -> crypto.performancewest.net 000
|
||||
# until someone manually restarted the deployment. This is why it "kept going
|
||||
# down". (Diagnosed 2026-06-09: pod Running 1/1 but HTTP to :30723 and even the
|
||||
# in-cluster svc returned 000/hung.)
|
||||
#
|
||||
# FIX: HTTP probes on / :5000 (returns 302 = healthy). Liveness auto-restarts a
|
||||
# hung pod; readiness pulls it from rotation. Chart shkeeper-1.7.15 does not
|
||||
# expose probes via helm values, so this is applied as a kubectl strategic-merge
|
||||
# patch (re-apply after any helm upgrade):
|
||||
#
|
||||
# KUBECONFIG=/etc/rancher/k3s/k3s.yaml \
|
||||
# k3s kubectl patch deploy shkeeper-deployment -n shkeeper \
|
||||
# --patch-file infra/k8s/shkeeper-liveness-probes-patch.yaml
|
||||
#
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: shkeeper
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /
|
||||
port: 5000
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 30
|
||||
timeoutSeconds: 10
|
||||
failureThreshold: 3
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /
|
||||
port: 5000
|
||||
initialDelaySeconds: 15
|
||||
periodSeconds: 15
|
||||
timeoutSeconds: 8
|
||||
failureThreshold: 2
|
||||
Loading…
Add table
Add a link
Reference in a new issue