docserver: self-healing Task Scheduler config + docs

Companion to the worker MinIO-retry fix. Makes the worker auto-recover from
process death (crash, manual kill, missed boot trigger), not just MinIO outages.

- start_worker.bat: propagate Python's exit code (exit /b %rc%) so Task
  Scheduler can actually detect a failed run (it previously always exited 0).
- reconfigure_task.ps1 (new): re-registers PW-DocserverWorker with
  RestartCount=99 / 1-min interval, StartWhenAvailable, and two triggers —
  AtStartup plus a 5-min repeating trigger with MultipleInstances=IgnoreNew, so
  a dead worker relaunches within ~5 min and never double-runs. Idempotent.
- install.ps1: same self-healing settings for fresh installs.
- Verified on the box: killed the worker -> task relaunched it; firing again
  while running stayed at one instance.

Docs updated to match reality:
- docserver/README.md: new 'Reliability / self-healing' section.
- document-generation.md: corrected the stale 'Flask DocServer :5050 / HTTP'
  description to the actual MinIO outbound-only transport.
- e2e-test-plan.md: removed the outdated 'Word COM fails under SYSTEM / requires
  RDP after every reboot' limitation; now self-healing under SYSTEM session 0.
- infrastructure.md: fixed VM spec (Win Server 2019, Word 16.0, Python 3.13,
  SSH port 22422) + self-healing note.
- architecture.md / formation-system.md: trigger + self-healing details.
This commit is contained in:
justin 2026-06-15 22:49:21 -05:00
parent 7929413eeb
commit b48d0cb799
9 changed files with 150 additions and 24 deletions

View file

@ -106,8 +106,8 @@ ERPNext custom Frappe apps (baked into `performancewest-erpnext:latest`):
│ │ Office 365 Word (COM automation) │ │ │ │ Office 365 Word (COM automation) │ │
│ │ Python 3.13 + pywin32 + minio SDK │ │ │ │ Python 3.13 + pywin32 + minio SDK │ │
│ │ docserver_worker.py (MinIO poller, 12s interval) │ │ │ │ docserver_worker.py (MinIO poller, 12s interval) │ │
│ │ Task Scheduler: PW-DocserverWorker (AtLogOn) │ │ │ │ Task Scheduler: PW-DocserverWorker (AtStartup + 5-min) │ │
│ │ Auto-logon configured (requires RDP after cold reboot) │ │ │ │ Self-healing: restart-on-fail + MinIO-retry (no RDP) │ │
│ │ Private network: 10.4.20.247 → MinIO via nginx │ │ │ │ Private network: 10.4.20.247 → MinIO via nginx │ │
│ └──────────────────────────────────────────────────────────┘ │ │ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘ └─────────────────────────────────────────────────────────────────┘
@ -284,6 +284,10 @@ Workers: upload DOCX to minio://performancewest/to-convert/{uuid}.docx
- **Heartbeat:** DocServer writes `docserver-heartbeat.json` to MinIO every 60 seconds - **Heartbeat:** DocServer writes `docserver-heartbeat.json` to MinIO every 60 seconds
- **Fallback:** If heartbeat is stale (>5 min), workers auto-switch to LibreOffice headless - **Fallback:** If heartbeat is stale (>5 min), workers auto-switch to LibreOffice headless
- **Self-healing:** the worker retries MinIO with backoff instead of exiting on an
outage; the `PW-DocserverWorker` task restarts on failure (99×/1 min) and has a
5-min repeating trigger, so a crash/missed boot self-recovers without RDP. See
`docserver/README.md`.
## Boot Sequence ## Boot Sequence

View file

@ -194,20 +194,36 @@ DOCX to PDF conversion uses a two-tier approach:
### PRIMARY: Windows DocServer (Microsoft Word COM) ### PRIMARY: Windows DocServer (Microsoft Word COM)
A Windows server runs a Flask-based DocServer at `:5050` that uses Microsoft Word via COM A Windows server runs `docserver_worker.py` that uses Microsoft Word via COM
automation for pixel-perfect DOCX → PDF conversion. This produces the highest-fidelity automation for pixel-perfect DOCX → PDF conversion. This produces the highest-
output (exact font rendering, correct page breaks, proper table formatting). fidelity output (exact font rendering, correct page breaks, proper table
formatting).
The transport is **MinIO, not HTTP** — the Windows VM only makes **outbound**
connections to MinIO, so there are no open inbound ports / SSH tunnels and it
works behind any NAT:
```text
pdf_converter.py (Linux) MinIO (S3) docserver_worker.py (Windows)
PUT docx → to-convert/{id}.docx ─────────► │
│◄─ poll every 12s ───────┤
│ ├─ Word.SaveAs → PDF
GET pdf ← converted/{id}.pdf ◄──────────│◄─ PUT converted/{id}.pdf┘
DEL docx / DEL pdf (cleanup)
```
```python ```python
# pdf_converter.py — primary path # pdf_converter.py — primary path (simplified)
response = requests.post( mc.put_object(bucket, f"to-convert/{job_id}.docx", docx_stream, length)
f"http://{DOCSERVER_HOST}:5050/convert", # ...poll until converted/{job_id}.pdf appears (DOCSERVER_TIMEOUT, default 120s)...
files={"file": open(docx_path, "rb")}, pdf_bytes = mc.get_object(bucket, f"converted/{job_id}.pdf").read()
timeout=60,
)
pdf_bytes = response.content
``` ```
The Windows worker is **self-healing**: it retries MinIO with backoff instead of
exiting on a transient outage, and its `PW-DocserverWorker` scheduled task
restarts on failure plus re-fires every 5 minutes if the process dies. See
`docserver/README.md` → "Reliability / self-healing".
### FALLBACK: LibreOffice Headless ### FALLBACK: LibreOffice Headless
If DocServer is unavailable (network error, timeout, Windows server down), the converter If DocServer is unavailable (network error, timeout, Windows server down), the converter

View file

@ -163,16 +163,25 @@ Write `scripts/tests/e2e_crtc_pipeline.py`:
### DocServer Investigation ### DocServer Investigation
Word COM fails under SYSTEM account and "Run whether user is logged on or not" mode. As of 2026-06, the worker runs fine under the SYSTEM account in session 0 on
Requires interactive desktop session (RDP login). Auto-logon configured (registry keys set) this Windows Server 2019 box (Word COM initialises and converts normally), so the
but blocked by hosting provider's Windows Server 2019 policy. old "requires an interactive RDP login" workaround is no longer needed for normal
operation. It is **self-healing**:
**Workaround:** RDP into the VM once after reboot → AtLogOn trigger fires → Word COM works. - It retries the MinIO connection with backoff instead of `sys.exit(1)`, so a
LibreOffice fallback handles conversions automatically when DocServer is unavailable. transient MinIO 502 / outage no longer kills it (that was the cause of a
multi-week outage in May 2026).
- The `PW-DocserverWorker` task restarts on failure (99×/1 min) and has a
5-minute repeating safety trigger (`MultipleInstances=IgnoreNew`), so a crash
or missed boot trigger self-recovers within ~5 min without a reboot/RDP.
LibreOffice fallback still handles conversions automatically if DocServer is ever
unavailable. See `docserver/README.md` → "Reliability / self-healing".
### Known Limitations ### Known Limitations
1. **DocServer** — requires RDP login after cold reboot (auto-logon blocked by hosting provider) 1. **DocServer** — self-healing (auto-restarts on MinIO outage/crash); RDP only
needed if Word COM itself breaks (DCOM misconfig → run `fix_dcom.bat`)
2. **eSign JWT** — test uses different secret than dev API; falls back to PG simulation 2. **eSign JWT** — test uses different secret than dev API; falls back to PG simulation
3. **Compliance Calendar** — DocType not imported to ERPNext; 417 error on query 3. **Compliance Calendar** — DocType not imported to ERPNext; 417 error on query
4. **ERPNext screenshots** — Playwright can't log into ERPNext from Docker (login page structure) 4. **ERPNext screenshots** — Playwright can't log into ERPNext from Docker (login page structure)

View file

@ -238,7 +238,7 @@ Flags for support conversations (escalation, priority, category).
- Converts via Word COM, drops PDF in `converted/` bucket - Converts via Word COM, drops PDF in `converted/` bucket
- Heartbeat file at `minio://performancewest/docserver-heartbeat.json` (60s interval) - Heartbeat file at `minio://performancewest/docserver-heartbeat.json` (60s interval)
- Atomic uploads via `.tmp_` prefix + `copy_object` rename - Atomic uploads via `.tmp_` prefix + `copy_object` rename
- Task Scheduler: `PW-DocserverWorker`auto-restart on failure - Task Scheduler: `PW-DocserverWorker`self-healing: restarts on failure (99×/1 min) + AtStartup and a 5-min repeating trigger (relaunches within ~5 min if the process dies). The worker also retries MinIO on outage instead of exiting.
- **Fallback:** LibreOffice headless (`soffice --headless --convert-to pdf`) auto-activates when DocServer heartbeat stale (>5 min) - **Fallback:** LibreOffice headless (`soffice --headless --convert-to pdf`) auto-activates when DocServer heartbeat stale (>5 min)
- **E2E tested:** 36KB DOCX → 82KB PDF in 12 seconds total round-trip - **E2E tested:** 36KB DOCX → 82KB PDF in 12 seconds total round-trip

View file

@ -18,14 +18,15 @@
| Resource | Spec | | Resource | Spec |
|----------|------| |----------|------|
| OS | Windows Server 2022 | | IP / SSH | 108.181.102.34 (OpenSSH for Windows, **port 22422**) |
| OS | Windows Server 2019 (10.0.17763) |
| vCPU | 2 | | vCPU | 2 |
| RAM | 4 GB | | RAM | 4 GB |
| Disk | 40 GB SSD | | Disk | 40 GB SSD |
| Software | Microsoft Office 2021 + Python 3.12 | | Software | Microsoft Word (Office 16.0) + Python 3.13 |
| Service | docserver_worker.py (polls MinIO, converts via Word COM) | | Service | docserver_worker.py (polls MinIO, converts via Word COM) |
Pixel-perfect DOCX→PDF conversion via Microsoft Word. Worker polls MinIO `to-convert/` bucket, converts via Word COM, uploads PDF to `converted/`. No HTTP server needed — MinIO is the transport. Requires RDP login after reboot (Word COM needs interactive session). LibreOffice headless is the automatic fallback. Pixel-perfect DOCX→PDF conversion via Microsoft Word. Worker polls MinIO `to-convert/` bucket, converts via Word COM, uploads PDF to `converted/`. No HTTP server needed — MinIO is the transport (outbound-only; works behind any NAT). Runs under the SYSTEM account in session 0; **self-healing** — retries MinIO on outage instead of exiting, and the `PW-DocserverWorker` task restarts on failure + re-fires every 5 min if the process dies (no RDP needed for normal operation; RDP/`fix_dcom.bat` only if Word COM itself breaks). Heartbeat at `minio://performancewest/docserver-heartbeat.json` (60s). LibreOffice headless is the automatic fallback. Details: `docserver/README.md`.
## Email Servers ## Email Servers

View file

@ -61,6 +61,32 @@ This will:
The worker must run as a **logged-in user** — Word COM requires an interactive The worker must run as a **logged-in user** — Word COM requires an interactive
Windows session and will fail under a system service account. Windows session and will fail under a system service account.
## Reliability / self-healing
The worker is designed to recover from outages without manual intervention:
- **MinIO outages don't kill it.** The worker retries the MinIO connection
indefinitely with capped exponential backoff (5s → 120s) instead of exiting,
and each poll cycle is wrapped so a transient network error / 502 just
rebuilds the client and keeps going. (Previously a single 502 made the worker
`sys.exit(1)`, leaving it dead until a reboot.)
- **Crashes / kills are auto-recovered by Task Scheduler.** The
`PW-DocserverWorker` task has:
- `RestartCount=99`, `RestartInterval=1 min` — relaunch if the action fails,
- **two triggers**: `AtStartup` plus a **repeating trigger every 5 minutes**
with `MultipleInstances=IgnoreNew`, so if the process ever dies (crash,
manual kill, or a missed boot trigger) it relaunches within ~5 min and
never runs more than one instance,
- `StartWhenAvailable` to catch up a missed trigger.
- `start_worker.bat` **propagates Python's exit code** (`exit /b %rc%`) so
Scheduler can actually detect a failed run.
To re-apply these task settings on an existing install, run as Administrator:
```powershell
powershell -ExecutionPolicy Bypass -File C:\docserver\reconfigure_task.ps1
```
## How to access MinIO externally ## How to access MinIO externally
The Windows VM needs to reach MinIO. Options: The Windows VM needs to reach MinIO. Options:

View file

@ -166,11 +166,18 @@ $action = New-ScheduledTaskAction `
-Argument "/c `"$AppDir\start_worker.bat`"" ` -Argument "/c `"$AppDir\start_worker.bat`"" `
-WorkingDirectory $AppDir -WorkingDirectory $AppDir
$trigger = New-ScheduledTaskTrigger -AtStartup # Two triggers for self-healing: at boot, plus a repeating 5-minute safety net
# that relaunches the worker if its process ever dies (crash, manual kill, or a
# missed boot trigger). MultipleInstances=IgnoreNew keeps it to one instance.
$atStartup = New-ScheduledTaskTrigger -AtStartup
$repeat = New-ScheduledTaskTrigger -Once -At (Get-Date) `
-RepetitionInterval (New-TimeSpan -Minutes 5)
try { $repeat.Repetition.Duration = 'P3650D' } catch {} # some builds need an explicit long duration
$trigger = @($atStartup, $repeat)
$settings = New-ScheduledTaskSettingsSet ` $settings = New-ScheduledTaskSettingsSet `
-ExecutionTimeLimit (New-TimeSpan -Hours 0) ` -ExecutionTimeLimit (New-TimeSpan -Hours 0) `
-RestartCount 10 ` -RestartCount 99 `
-RestartInterval (New-TimeSpan -Minutes 1) ` -RestartInterval (New-TimeSpan -Minutes 1) `
-StartWhenAvailable ` -StartWhenAvailable `
-MultipleInstances IgnoreNew ` -MultipleInstances IgnoreNew `

View file

@ -0,0 +1,46 @@
# Reconfigures the PW-DocserverWorker scheduled task for self-healing:
# - restart up to 99x at 1-min intervals if the task action fails
# - StartWhenAvailable (catch up if a trigger was missed)
# - a repeating safety trigger every 5 min with MultipleInstances=IgnoreNew,
# so if the worker process ever dies (crash, manual kill, missed boot
# trigger) it relaunches within ~5 min instead of waiting for a reboot
# - keeps AtStartup + SYSTEM/Highest (current working config)
# Idempotent: safe to re-run. Run as Administrator.
$ErrorActionPreference = 'Stop'
$taskName = 'PW-DocserverWorker'
$appDir = 'C:\docserver'
$action = New-ScheduledTaskAction -Execute 'cmd.exe' `
-Argument "/c `"$appDir\start_worker.bat`"" -WorkingDirectory $appDir
# Two triggers: at boot, and a repeating safety net every 5 minutes (indefinitely).
$atStartup = New-ScheduledTaskTrigger -AtStartup
$repeat = New-ScheduledTaskTrigger -Once -At (Get-Date) `
-RepetitionInterval (New-TimeSpan -Minutes 5)
# Some Windows builds cap repetition without an explicit long duration; set ~10y.
try { $repeat.Repetition.Duration = 'P3650D' } catch {}
$settings = New-ScheduledTaskSettingsSet `
-ExecutionTimeLimit (New-TimeSpan -Hours 0) `
-RestartCount 99 `
-RestartInterval (New-TimeSpan -Minutes 1) `
-StartWhenAvailable `
-MultipleInstances IgnoreNew `
-AllowStartIfOnBatteries `
-DontStopIfGoingOnBatteries
$principal = New-ScheduledTaskPrincipal -UserId 'SYSTEM' `
-LogonType ServiceAccount -RunLevel Highest
Register-ScheduledTask -TaskName $taskName -Action $action `
-Trigger @($atStartup, $repeat) -Settings $settings -Principal $principal `
-Description 'Performance West DOCX-to-PDF worker (MinIO + Word COM). Self-healing: restarts on failure + 5-min safety trigger.' `
-Force | Out-Null
Write-Host "Reconfigured ${taskName}:"
$ti = Get-ScheduledTask -TaskName $taskName
$ti.Triggers | ForEach-Object { Write-Host (" trigger: " + $_.CimClass.CimClassName) }
$s = $ti.Settings
Write-Host (" RestartCount=" + $s.RestartCount + " RestartInterval=" + $s.RestartInterval +
" StartWhenAvailable=" + $s.StartWhenAvailable + " MultipleInstances=" + $s.MultipleInstances)
Write-Host (" State=" + $ti.State)

View file

@ -0,0 +1,17 @@
@echo off
setlocal enabledelayedexpansion
cd /d C:\docserver
echo [%date% %time%] Starting Performance West Docserver Worker...
for /f "usebackq tokens=1,* delims==" %%a in ("C:\docserver\docserver.env") do (
set "ln=%%a"
if not "!ln:~0,1!"=="#" (
if not "%%a"=="" set "%%a=%%b"
)
)
C:\Python313\python.exe C:\docserver\docserver_worker.py
set "rc=%errorlevel%"
echo [%date% %time%] Worker exited with code %rc%.
endlocal & exit /b %rc%