docserver: self-healing Task Scheduler config + docs

Companion to the worker MinIO-retry fix. Makes the worker auto-recover from process death (crash, manual kill, missed boot trigger), not just MinIO outages. - start_worker.bat: propagate Python's exit code (exit /b %rc%) so Task Scheduler can actually detect a failed run (it previously always exited 0). - reconfigure_task.ps1 (new): re-registers PW-DocserverWorker with RestartCount=99 / 1-min interval, StartWhenAvailable, and two triggers — AtStartup plus a 5-min repeating trigger with MultipleInstances=IgnoreNew, so a dead worker relaunches within ~5 min and never double-runs. Idempotent. - install.ps1: same self-healing settings for fresh installs. - Verified on the box: killed the worker -> task relaunched it; firing again while running stayed at one instance. Docs updated to match reality: - docserver/README.md: new 'Reliability / self-healing' section. - document-generation.md: corrected the stale 'Flask DocServer :5050 / HTTP' description to the actual MinIO outbound-only transport. - e2e-test-plan.md: removed the outdated 'Word COM fails under SYSTEM / requires RDP after every reboot' limitation; now self-healing under SYSTEM session 0. - infrastructure.md: fixed VM spec (Win Server 2019, Word 16.0, Python 3.13, SSH port 22422) + self-healing note. - architecture.md / formation-system.md: trigger + self-healing details.
2026-06-15 22:49:21 -05:00 · 2026-06-15 22:49:21 -05:00 · b48d0cb799
commit b48d0cb799
parent 7929413eeb
9 changed files with 150 additions and 24 deletions
--- a/docserver/README.md
+++ b/docserver/README.md
@ -61,6 +61,32 @@ This will:
 The worker must run as a **logged-in user** — Word COM requires an interactive
 Windows session and will fail under a system service account.

+## Reliability / self-healing
+
+The worker is designed to recover from outages without manual intervention:
+
+- **MinIO outages don't kill it.** The worker retries the MinIO connection
+  indefinitely with capped exponential backoff (5s → 120s) instead of exiting,
+  and each poll cycle is wrapped so a transient network error / 502 just
+  rebuilds the client and keeps going. (Previously a single 502 made the worker
+  `sys.exit(1)`, leaving it dead until a reboot.)
+- **Crashes / kills are auto-recovered by Task Scheduler.** The
+  `PW-DocserverWorker` task has:
+  - `RestartCount=99`, `RestartInterval=1 min` — relaunch if the action fails,
+  - **two triggers**: `AtStartup` plus a **repeating trigger every 5 minutes**
+    with `MultipleInstances=IgnoreNew`, so if the process ever dies (crash,
+    manual kill, or a missed boot trigger) it relaunches within ~5 min and
+    never runs more than one instance,
+  - `StartWhenAvailable` to catch up a missed trigger.
+- `start_worker.bat` **propagates Python's exit code** (`exit /b %rc%`) so
+  Scheduler can actually detect a failed run.
+
+To re-apply these task settings on an existing install, run as Administrator:
+
+```powershell
+powershell -ExecutionPolicy Bypass -File C:\docserver\reconfigure_task.ps1
+```
+
 ## How to access MinIO externally

 The Windows VM needs to reach MinIO. Options: