docserver: self-healing Task Scheduler config + docs
Companion to the worker MinIO-retry fix. Makes the worker auto-recover from process death (crash, manual kill, missed boot trigger), not just MinIO outages. - start_worker.bat: propagate Python's exit code (exit /b %rc%) so Task Scheduler can actually detect a failed run (it previously always exited 0). - reconfigure_task.ps1 (new): re-registers PW-DocserverWorker with RestartCount=99 / 1-min interval, StartWhenAvailable, and two triggers — AtStartup plus a 5-min repeating trigger with MultipleInstances=IgnoreNew, so a dead worker relaunches within ~5 min and never double-runs. Idempotent. - install.ps1: same self-healing settings for fresh installs. - Verified on the box: killed the worker -> task relaunched it; firing again while running stayed at one instance. Docs updated to match reality: - docserver/README.md: new 'Reliability / self-healing' section. - document-generation.md: corrected the stale 'Flask DocServer :5050 / HTTP' description to the actual MinIO outbound-only transport. - e2e-test-plan.md: removed the outdated 'Word COM fails under SYSTEM / requires RDP after every reboot' limitation; now self-healing under SYSTEM session 0. - infrastructure.md: fixed VM spec (Win Server 2019, Word 16.0, Python 3.13, SSH port 22422) + self-healing note. - architecture.md / formation-system.md: trigger + self-healing details.
This commit is contained in:
parent
7929413eeb
commit
b48d0cb799
9 changed files with 150 additions and 24 deletions
|
|
@ -106,8 +106,8 @@ ERPNext custom Frappe apps (baked into `performancewest-erpnext:latest`):
|
|||
│ │ Office 365 Word (COM automation) │ │
|
||||
│ │ Python 3.13 + pywin32 + minio SDK │ │
|
||||
│ │ docserver_worker.py (MinIO poller, 12s interval) │ │
|
||||
│ │ Task Scheduler: PW-DocserverWorker (AtLogOn) │ │
|
||||
│ │ Auto-logon configured (requires RDP after cold reboot) │ │
|
||||
│ │ Task Scheduler: PW-DocserverWorker (AtStartup + 5-min) │ │
|
||||
│ │ Self-healing: restart-on-fail + MinIO-retry (no RDP) │ │
|
||||
│ │ Private network: 10.4.20.247 → MinIO via nginx │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
|
@ -284,6 +284,10 @@ Workers: upload DOCX to minio://performancewest/to-convert/{uuid}.docx
|
|||
|
||||
- **Heartbeat:** DocServer writes `docserver-heartbeat.json` to MinIO every 60 seconds
|
||||
- **Fallback:** If heartbeat is stale (>5 min), workers auto-switch to LibreOffice headless
|
||||
- **Self-healing:** the worker retries MinIO with backoff instead of exiting on an
|
||||
outage; the `PW-DocserverWorker` task restarts on failure (99×/1 min) and has a
|
||||
5-min repeating trigger, so a crash/missed boot self-recovers without RDP. See
|
||||
`docserver/README.md`.
|
||||
|
||||
## Boot Sequence
|
||||
|
||||
|
|
|
|||
|
|
@ -194,20 +194,36 @@ DOCX to PDF conversion uses a two-tier approach:
|
|||
|
||||
### PRIMARY: Windows DocServer (Microsoft Word COM)
|
||||
|
||||
A Windows server runs a Flask-based DocServer at `:5050` that uses Microsoft Word via COM
|
||||
automation for pixel-perfect DOCX → PDF conversion. This produces the highest-fidelity
|
||||
output (exact font rendering, correct page breaks, proper table formatting).
|
||||
A Windows server runs `docserver_worker.py` that uses Microsoft Word via COM
|
||||
automation for pixel-perfect DOCX → PDF conversion. This produces the highest-
|
||||
fidelity output (exact font rendering, correct page breaks, proper table
|
||||
formatting).
|
||||
|
||||
The transport is **MinIO, not HTTP** — the Windows VM only makes **outbound**
|
||||
connections to MinIO, so there are no open inbound ports / SSH tunnels and it
|
||||
works behind any NAT:
|
||||
|
||||
```text
|
||||
pdf_converter.py (Linux) MinIO (S3) docserver_worker.py (Windows)
|
||||
PUT docx → to-convert/{id}.docx ─────────► │
|
||||
│◄─ poll every 12s ───────┤
|
||||
│ ├─ Word.SaveAs → PDF
|
||||
GET pdf ← converted/{id}.pdf ◄──────────│◄─ PUT converted/{id}.pdf┘
|
||||
DEL docx / DEL pdf (cleanup)
|
||||
```
|
||||
|
||||
```python
|
||||
# pdf_converter.py — primary path
|
||||
response = requests.post(
|
||||
f"http://{DOCSERVER_HOST}:5050/convert",
|
||||
files={"file": open(docx_path, "rb")},
|
||||
timeout=60,
|
||||
)
|
||||
pdf_bytes = response.content
|
||||
# pdf_converter.py — primary path (simplified)
|
||||
mc.put_object(bucket, f"to-convert/{job_id}.docx", docx_stream, length)
|
||||
# ...poll until converted/{job_id}.pdf appears (DOCSERVER_TIMEOUT, default 120s)...
|
||||
pdf_bytes = mc.get_object(bucket, f"converted/{job_id}.pdf").read()
|
||||
```
|
||||
|
||||
The Windows worker is **self-healing**: it retries MinIO with backoff instead of
|
||||
exiting on a transient outage, and its `PW-DocserverWorker` scheduled task
|
||||
restarts on failure plus re-fires every 5 minutes if the process dies. See
|
||||
`docserver/README.md` → "Reliability / self-healing".
|
||||
|
||||
### FALLBACK: LibreOffice Headless
|
||||
|
||||
If DocServer is unavailable (network error, timeout, Windows server down), the converter
|
||||
|
|
|
|||
|
|
@ -163,16 +163,25 @@ Write `scripts/tests/e2e_crtc_pipeline.py`:
|
|||
|
||||
### DocServer Investigation
|
||||
|
||||
Word COM fails under SYSTEM account and "Run whether user is logged on or not" mode.
|
||||
Requires interactive desktop session (RDP login). Auto-logon configured (registry keys set)
|
||||
but blocked by hosting provider's Windows Server 2019 policy.
|
||||
As of 2026-06, the worker runs fine under the SYSTEM account in session 0 on
|
||||
this Windows Server 2019 box (Word COM initialises and converts normally), so the
|
||||
old "requires an interactive RDP login" workaround is no longer needed for normal
|
||||
operation. It is **self-healing**:
|
||||
|
||||
**Workaround:** RDP into the VM once after reboot → AtLogOn trigger fires → Word COM works.
|
||||
LibreOffice fallback handles conversions automatically when DocServer is unavailable.
|
||||
- It retries the MinIO connection with backoff instead of `sys.exit(1)`, so a
|
||||
transient MinIO 502 / outage no longer kills it (that was the cause of a
|
||||
multi-week outage in May 2026).
|
||||
- The `PW-DocserverWorker` task restarts on failure (99×/1 min) and has a
|
||||
5-minute repeating safety trigger (`MultipleInstances=IgnoreNew`), so a crash
|
||||
or missed boot trigger self-recovers within ~5 min without a reboot/RDP.
|
||||
|
||||
LibreOffice fallback still handles conversions automatically if DocServer is ever
|
||||
unavailable. See `docserver/README.md` → "Reliability / self-healing".
|
||||
|
||||
### Known Limitations
|
||||
|
||||
1. **DocServer** — requires RDP login after cold reboot (auto-logon blocked by hosting provider)
|
||||
1. **DocServer** — self-healing (auto-restarts on MinIO outage/crash); RDP only
|
||||
needed if Word COM itself breaks (DCOM misconfig → run `fix_dcom.bat`)
|
||||
2. **eSign JWT** — test uses different secret than dev API; falls back to PG simulation
|
||||
3. **Compliance Calendar** — DocType not imported to ERPNext; 417 error on query
|
||||
4. **ERPNext screenshots** — Playwright can't log into ERPNext from Docker (login page structure)
|
||||
|
|
|
|||
|
|
@ -238,7 +238,7 @@ Flags for support conversations (escalation, priority, category).
|
|||
- Converts via Word COM, drops PDF in `converted/` bucket
|
||||
- Heartbeat file at `minio://performancewest/docserver-heartbeat.json` (60s interval)
|
||||
- Atomic uploads via `.tmp_` prefix + `copy_object` rename
|
||||
- Task Scheduler: `PW-DocserverWorker` — auto-restart on failure
|
||||
- Task Scheduler: `PW-DocserverWorker` — self-healing: restarts on failure (99×/1 min) + AtStartup and a 5-min repeating trigger (relaunches within ~5 min if the process dies). The worker also retries MinIO on outage instead of exiting.
|
||||
- **Fallback:** LibreOffice headless (`soffice --headless --convert-to pdf`) auto-activates when DocServer heartbeat stale (>5 min)
|
||||
- **E2E tested:** 36KB DOCX → 82KB PDF in 12 seconds total round-trip
|
||||
|
||||
|
|
|
|||
|
|
@ -18,14 +18,15 @@
|
|||
|
||||
| Resource | Spec |
|
||||
|----------|------|
|
||||
| OS | Windows Server 2022 |
|
||||
| IP / SSH | 108.181.102.34 (OpenSSH for Windows, **port 22422**) |
|
||||
| OS | Windows Server 2019 (10.0.17763) |
|
||||
| vCPU | 2 |
|
||||
| RAM | 4 GB |
|
||||
| Disk | 40 GB SSD |
|
||||
| Software | Microsoft Office 2021 + Python 3.12 |
|
||||
| Software | Microsoft Word (Office 16.0) + Python 3.13 |
|
||||
| Service | docserver_worker.py (polls MinIO, converts via Word COM) |
|
||||
|
||||
Pixel-perfect DOCX→PDF conversion via Microsoft Word. Worker polls MinIO `to-convert/` bucket, converts via Word COM, uploads PDF to `converted/`. No HTTP server needed — MinIO is the transport. Requires RDP login after reboot (Word COM needs interactive session). LibreOffice headless is the automatic fallback.
|
||||
Pixel-perfect DOCX→PDF conversion via Microsoft Word. Worker polls MinIO `to-convert/` bucket, converts via Word COM, uploads PDF to `converted/`. No HTTP server needed — MinIO is the transport (outbound-only; works behind any NAT). Runs under the SYSTEM account in session 0; **self-healing** — retries MinIO on outage instead of exiting, and the `PW-DocserverWorker` task restarts on failure + re-fires every 5 min if the process dies (no RDP needed for normal operation; RDP/`fix_dcom.bat` only if Word COM itself breaks). Heartbeat at `minio://performancewest/docserver-heartbeat.json` (60s). LibreOffice headless is the automatic fallback. Details: `docserver/README.md`.
|
||||
|
||||
## Email Servers
|
||||
|
||||
|
|
|
|||
|
|
@ -61,6 +61,32 @@ This will:
|
|||
The worker must run as a **logged-in user** — Word COM requires an interactive
|
||||
Windows session and will fail under a system service account.
|
||||
|
||||
## Reliability / self-healing
|
||||
|
||||
The worker is designed to recover from outages without manual intervention:
|
||||
|
||||
- **MinIO outages don't kill it.** The worker retries the MinIO connection
|
||||
indefinitely with capped exponential backoff (5s → 120s) instead of exiting,
|
||||
and each poll cycle is wrapped so a transient network error / 502 just
|
||||
rebuilds the client and keeps going. (Previously a single 502 made the worker
|
||||
`sys.exit(1)`, leaving it dead until a reboot.)
|
||||
- **Crashes / kills are auto-recovered by Task Scheduler.** The
|
||||
`PW-DocserverWorker` task has:
|
||||
- `RestartCount=99`, `RestartInterval=1 min` — relaunch if the action fails,
|
||||
- **two triggers**: `AtStartup` plus a **repeating trigger every 5 minutes**
|
||||
with `MultipleInstances=IgnoreNew`, so if the process ever dies (crash,
|
||||
manual kill, or a missed boot trigger) it relaunches within ~5 min and
|
||||
never runs more than one instance,
|
||||
- `StartWhenAvailable` to catch up a missed trigger.
|
||||
- `start_worker.bat` **propagates Python's exit code** (`exit /b %rc%`) so
|
||||
Scheduler can actually detect a failed run.
|
||||
|
||||
To re-apply these task settings on an existing install, run as Administrator:
|
||||
|
||||
```powershell
|
||||
powershell -ExecutionPolicy Bypass -File C:\docserver\reconfigure_task.ps1
|
||||
```
|
||||
|
||||
## How to access MinIO externally
|
||||
|
||||
The Windows VM needs to reach MinIO. Options:
|
||||
|
|
|
|||
|
|
@ -166,11 +166,18 @@ $action = New-ScheduledTaskAction `
|
|||
-Argument "/c `"$AppDir\start_worker.bat`"" `
|
||||
-WorkingDirectory $AppDir
|
||||
|
||||
$trigger = New-ScheduledTaskTrigger -AtStartup
|
||||
# Two triggers for self-healing: at boot, plus a repeating 5-minute safety net
|
||||
# that relaunches the worker if its process ever dies (crash, manual kill, or a
|
||||
# missed boot trigger). MultipleInstances=IgnoreNew keeps it to one instance.
|
||||
$atStartup = New-ScheduledTaskTrigger -AtStartup
|
||||
$repeat = New-ScheduledTaskTrigger -Once -At (Get-Date) `
|
||||
-RepetitionInterval (New-TimeSpan -Minutes 5)
|
||||
try { $repeat.Repetition.Duration = 'P3650D' } catch {} # some builds need an explicit long duration
|
||||
$trigger = @($atStartup, $repeat)
|
||||
|
||||
$settings = New-ScheduledTaskSettingsSet `
|
||||
-ExecutionTimeLimit (New-TimeSpan -Hours 0) `
|
||||
-RestartCount 10 `
|
||||
-RestartCount 99 `
|
||||
-RestartInterval (New-TimeSpan -Minutes 1) `
|
||||
-StartWhenAvailable `
|
||||
-MultipleInstances IgnoreNew `
|
||||
|
|
|
|||
46
docserver/reconfigure_task.ps1
Normal file
46
docserver/reconfigure_task.ps1
Normal file
|
|
@ -0,0 +1,46 @@
|
|||
# Reconfigures the PW-DocserverWorker scheduled task for self-healing:
|
||||
# - restart up to 99x at 1-min intervals if the task action fails
|
||||
# - StartWhenAvailable (catch up if a trigger was missed)
|
||||
# - a repeating safety trigger every 5 min with MultipleInstances=IgnoreNew,
|
||||
# so if the worker process ever dies (crash, manual kill, missed boot
|
||||
# trigger) it relaunches within ~5 min instead of waiting for a reboot
|
||||
# - keeps AtStartup + SYSTEM/Highest (current working config)
|
||||
# Idempotent: safe to re-run. Run as Administrator.
|
||||
$ErrorActionPreference = 'Stop'
|
||||
$taskName = 'PW-DocserverWorker'
|
||||
$appDir = 'C:\docserver'
|
||||
|
||||
$action = New-ScheduledTaskAction -Execute 'cmd.exe' `
|
||||
-Argument "/c `"$appDir\start_worker.bat`"" -WorkingDirectory $appDir
|
||||
|
||||
# Two triggers: at boot, and a repeating safety net every 5 minutes (indefinitely).
|
||||
$atStartup = New-ScheduledTaskTrigger -AtStartup
|
||||
$repeat = New-ScheduledTaskTrigger -Once -At (Get-Date) `
|
||||
-RepetitionInterval (New-TimeSpan -Minutes 5)
|
||||
# Some Windows builds cap repetition without an explicit long duration; set ~10y.
|
||||
try { $repeat.Repetition.Duration = 'P3650D' } catch {}
|
||||
|
||||
$settings = New-ScheduledTaskSettingsSet `
|
||||
-ExecutionTimeLimit (New-TimeSpan -Hours 0) `
|
||||
-RestartCount 99 `
|
||||
-RestartInterval (New-TimeSpan -Minutes 1) `
|
||||
-StartWhenAvailable `
|
||||
-MultipleInstances IgnoreNew `
|
||||
-AllowStartIfOnBatteries `
|
||||
-DontStopIfGoingOnBatteries
|
||||
|
||||
$principal = New-ScheduledTaskPrincipal -UserId 'SYSTEM' `
|
||||
-LogonType ServiceAccount -RunLevel Highest
|
||||
|
||||
Register-ScheduledTask -TaskName $taskName -Action $action `
|
||||
-Trigger @($atStartup, $repeat) -Settings $settings -Principal $principal `
|
||||
-Description 'Performance West DOCX-to-PDF worker (MinIO + Word COM). Self-healing: restarts on failure + 5-min safety trigger.' `
|
||||
-Force | Out-Null
|
||||
|
||||
Write-Host "Reconfigured ${taskName}:"
|
||||
$ti = Get-ScheduledTask -TaskName $taskName
|
||||
$ti.Triggers | ForEach-Object { Write-Host (" trigger: " + $_.CimClass.CimClassName) }
|
||||
$s = $ti.Settings
|
||||
Write-Host (" RestartCount=" + $s.RestartCount + " RestartInterval=" + $s.RestartInterval +
|
||||
" StartWhenAvailable=" + $s.StartWhenAvailable + " MultipleInstances=" + $s.MultipleInstances)
|
||||
Write-Host (" State=" + $ti.State)
|
||||
17
docserver/start_worker.bat
Normal file
17
docserver/start_worker.bat
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
@echo off
|
||||
setlocal enabledelayedexpansion
|
||||
cd /d C:\docserver
|
||||
|
||||
echo [%date% %time%] Starting Performance West Docserver Worker...
|
||||
|
||||
for /f "usebackq tokens=1,* delims==" %%a in ("C:\docserver\docserver.env") do (
|
||||
set "ln=%%a"
|
||||
if not "!ln:~0,1!"=="#" (
|
||||
if not "%%a"=="" set "%%a=%%b"
|
||||
)
|
||||
)
|
||||
|
||||
C:\Python313\python.exe C:\docserver\docserver_worker.py
|
||||
set "rc=%errorlevel%"
|
||||
echo [%date% %time%] Worker exited with code %rc%.
|
||||
endlocal & exit /b %rc%
|
||||
Loading…
Add table
Add a link
Reference in a new issue