docserver: self-healing Task Scheduler config + docs

Companion to the worker MinIO-retry fix. Makes the worker auto-recover from
process death (crash, manual kill, missed boot trigger), not just MinIO outages.

- start_worker.bat: propagate Python's exit code (exit /b %rc%) so Task
  Scheduler can actually detect a failed run (it previously always exited 0).
- reconfigure_task.ps1 (new): re-registers PW-DocserverWorker with
  RestartCount=99 / 1-min interval, StartWhenAvailable, and two triggers —
  AtStartup plus a 5-min repeating trigger with MultipleInstances=IgnoreNew, so
  a dead worker relaunches within ~5 min and never double-runs. Idempotent.
- install.ps1: same self-healing settings for fresh installs.
- Verified on the box: killed the worker -> task relaunched it; firing again
  while running stayed at one instance.

Docs updated to match reality:
- docserver/README.md: new 'Reliability / self-healing' section.
- document-generation.md: corrected the stale 'Flask DocServer :5050 / HTTP'
  description to the actual MinIO outbound-only transport.
- e2e-test-plan.md: removed the outdated 'Word COM fails under SYSTEM / requires
  RDP after every reboot' limitation; now self-healing under SYSTEM session 0.
- infrastructure.md: fixed VM spec (Win Server 2019, Word 16.0, Python 3.13,
  SSH port 22422) + self-healing note.
- architecture.md / formation-system.md: trigger + self-healing details.
This commit is contained in:
justin 2026-06-15 22:49:21 -05:00
parent 7929413eeb
commit b48d0cb799
9 changed files with 150 additions and 24 deletions

View file

@ -61,6 +61,32 @@ This will:
The worker must run as a **logged-in user** — Word COM requires an interactive
Windows session and will fail under a system service account.
## Reliability / self-healing
The worker is designed to recover from outages without manual intervention:
- **MinIO outages don't kill it.** The worker retries the MinIO connection
indefinitely with capped exponential backoff (5s → 120s) instead of exiting,
and each poll cycle is wrapped so a transient network error / 502 just
rebuilds the client and keeps going. (Previously a single 502 made the worker
`sys.exit(1)`, leaving it dead until a reboot.)
- **Crashes / kills are auto-recovered by Task Scheduler.** The
`PW-DocserverWorker` task has:
- `RestartCount=99`, `RestartInterval=1 min` — relaunch if the action fails,
- **two triggers**: `AtStartup` plus a **repeating trigger every 5 minutes**
with `MultipleInstances=IgnoreNew`, so if the process ever dies (crash,
manual kill, or a missed boot trigger) it relaunches within ~5 min and
never runs more than one instance,
- `StartWhenAvailable` to catch up a missed trigger.
- `start_worker.bat` **propagates Python's exit code** (`exit /b %rc%`) so
Scheduler can actually detect a failed run.
To re-apply these task settings on an existing install, run as Administrator:
```powershell
powershell -ExecutionPolicy Bypass -File C:\docserver\reconfigure_task.ps1
```
## How to access MinIO externally
The Windows VM needs to reach MinIO. Options:

View file

@ -166,11 +166,18 @@ $action = New-ScheduledTaskAction `
-Argument "/c `"$AppDir\start_worker.bat`"" `
-WorkingDirectory $AppDir
$trigger = New-ScheduledTaskTrigger -AtStartup
# Two triggers for self-healing: at boot, plus a repeating 5-minute safety net
# that relaunches the worker if its process ever dies (crash, manual kill, or a
# missed boot trigger). MultipleInstances=IgnoreNew keeps it to one instance.
$atStartup = New-ScheduledTaskTrigger -AtStartup
$repeat = New-ScheduledTaskTrigger -Once -At (Get-Date) `
-RepetitionInterval (New-TimeSpan -Minutes 5)
try { $repeat.Repetition.Duration = 'P3650D' } catch {} # some builds need an explicit long duration
$trigger = @($atStartup, $repeat)
$settings = New-ScheduledTaskSettingsSet `
-ExecutionTimeLimit (New-TimeSpan -Hours 0) `
-RestartCount 10 `
-RestartCount 99 `
-RestartInterval (New-TimeSpan -Minutes 1) `
-StartWhenAvailable `
-MultipleInstances IgnoreNew `

View file

@ -0,0 +1,46 @@
# Reconfigures the PW-DocserverWorker scheduled task for self-healing:
# - restart up to 99x at 1-min intervals if the task action fails
# - StartWhenAvailable (catch up if a trigger was missed)
# - a repeating safety trigger every 5 min with MultipleInstances=IgnoreNew,
# so if the worker process ever dies (crash, manual kill, missed boot
# trigger) it relaunches within ~5 min instead of waiting for a reboot
# - keeps AtStartup + SYSTEM/Highest (current working config)
# Idempotent: safe to re-run. Run as Administrator.
$ErrorActionPreference = 'Stop'
$taskName = 'PW-DocserverWorker'
$appDir = 'C:\docserver'
$action = New-ScheduledTaskAction -Execute 'cmd.exe' `
-Argument "/c `"$appDir\start_worker.bat`"" -WorkingDirectory $appDir
# Two triggers: at boot, and a repeating safety net every 5 minutes (indefinitely).
$atStartup = New-ScheduledTaskTrigger -AtStartup
$repeat = New-ScheduledTaskTrigger -Once -At (Get-Date) `
-RepetitionInterval (New-TimeSpan -Minutes 5)
# Some Windows builds cap repetition without an explicit long duration; set ~10y.
try { $repeat.Repetition.Duration = 'P3650D' } catch {}
$settings = New-ScheduledTaskSettingsSet `
-ExecutionTimeLimit (New-TimeSpan -Hours 0) `
-RestartCount 99 `
-RestartInterval (New-TimeSpan -Minutes 1) `
-StartWhenAvailable `
-MultipleInstances IgnoreNew `
-AllowStartIfOnBatteries `
-DontStopIfGoingOnBatteries
$principal = New-ScheduledTaskPrincipal -UserId 'SYSTEM' `
-LogonType ServiceAccount -RunLevel Highest
Register-ScheduledTask -TaskName $taskName -Action $action `
-Trigger @($atStartup, $repeat) -Settings $settings -Principal $principal `
-Description 'Performance West DOCX-to-PDF worker (MinIO + Word COM). Self-healing: restarts on failure + 5-min safety trigger.' `
-Force | Out-Null
Write-Host "Reconfigured ${taskName}:"
$ti = Get-ScheduledTask -TaskName $taskName
$ti.Triggers | ForEach-Object { Write-Host (" trigger: " + $_.CimClass.CimClassName) }
$s = $ti.Settings
Write-Host (" RestartCount=" + $s.RestartCount + " RestartInterval=" + $s.RestartInterval +
" StartWhenAvailable=" + $s.StartWhenAvailable + " MultipleInstances=" + $s.MultipleInstances)
Write-Host (" State=" + $ti.State)

View file

@ -0,0 +1,17 @@
@echo off
setlocal enabledelayedexpansion
cd /d C:\docserver
echo [%date% %time%] Starting Performance West Docserver Worker...
for /f "usebackq tokens=1,* delims==" %%a in ("C:\docserver\docserver.env") do (
set "ln=%%a"
if not "!ln:~0,1!"=="#" (
if not "%%a"=="" set "%%a=%%b"
)
)
C:\Python313\python.exe C:\docserver\docserver_worker.py
set "rc=%errorlevel%"
echo [%date% %time%] Worker exited with code %rc%.
endlocal & exit /b %rc%