diff --git a/docs/architecture.md b/docs/architecture.md index af5724c..549435f 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -106,8 +106,8 @@ ERPNext custom Frappe apps (baked into `performancewest-erpnext:latest`): │ │ Office 365 Word (COM automation) │ │ │ │ Python 3.13 + pywin32 + minio SDK │ │ │ │ docserver_worker.py (MinIO poller, 12s interval) │ │ -│ │ Task Scheduler: PW-DocserverWorker (AtLogOn) │ │ -│ │ Auto-logon configured (requires RDP after cold reboot) │ │ +│ │ Task Scheduler: PW-DocserverWorker (AtStartup + 5-min) │ │ +│ │ Self-healing: restart-on-fail + MinIO-retry (no RDP) │ │ │ │ Private network: 10.4.20.247 → MinIO via nginx │ │ │ └──────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ @@ -284,6 +284,10 @@ Workers: upload DOCX to minio://performancewest/to-convert/{uuid}.docx - **Heartbeat:** DocServer writes `docserver-heartbeat.json` to MinIO every 60 seconds - **Fallback:** If heartbeat is stale (>5 min), workers auto-switch to LibreOffice headless +- **Self-healing:** the worker retries MinIO with backoff instead of exiting on an + outage; the `PW-DocserverWorker` task restarts on failure (99×/1 min) and has a + 5-min repeating trigger, so a crash/missed boot self-recovers without RDP. See + `docserver/README.md`. ## Boot Sequence diff --git a/docs/document-generation.md b/docs/document-generation.md index af2ba5f..805cfdd 100644 --- a/docs/document-generation.md +++ b/docs/document-generation.md @@ -194,20 +194,36 @@ DOCX to PDF conversion uses a two-tier approach: ### PRIMARY: Windows DocServer (Microsoft Word COM) -A Windows server runs a Flask-based DocServer at `:5050` that uses Microsoft Word via COM -automation for pixel-perfect DOCX → PDF conversion. This produces the highest-fidelity -output (exact font rendering, correct page breaks, proper table formatting). +A Windows server runs `docserver_worker.py` that uses Microsoft Word via COM +automation for pixel-perfect DOCX → PDF conversion. This produces the highest- +fidelity output (exact font rendering, correct page breaks, proper table +formatting). + +The transport is **MinIO, not HTTP** — the Windows VM only makes **outbound** +connections to MinIO, so there are no open inbound ports / SSH tunnels and it +works behind any NAT: + +```text +pdf_converter.py (Linux) MinIO (S3) docserver_worker.py (Windows) + PUT docx → to-convert/{id}.docx ─────────► │ + │◄─ poll every 12s ───────┤ + │ ├─ Word.SaveAs → PDF + GET pdf ← converted/{id}.pdf ◄──────────│◄─ PUT converted/{id}.pdf┘ + DEL docx / DEL pdf (cleanup) +``` ```python -# pdf_converter.py — primary path -response = requests.post( - f"http://{DOCSERVER_HOST}:5050/convert", - files={"file": open(docx_path, "rb")}, - timeout=60, -) -pdf_bytes = response.content +# pdf_converter.py — primary path (simplified) +mc.put_object(bucket, f"to-convert/{job_id}.docx", docx_stream, length) +# ...poll until converted/{job_id}.pdf appears (DOCSERVER_TIMEOUT, default 120s)... +pdf_bytes = mc.get_object(bucket, f"converted/{job_id}.pdf").read() ``` +The Windows worker is **self-healing**: it retries MinIO with backoff instead of +exiting on a transient outage, and its `PW-DocserverWorker` scheduled task +restarts on failure plus re-fires every 5 minutes if the process dies. See +`docserver/README.md` → "Reliability / self-healing". + ### FALLBACK: LibreOffice Headless If DocServer is unavailable (network error, timeout, Windows server down), the converter diff --git a/docs/e2e-test-plan.md b/docs/e2e-test-plan.md index 5e331cc..1868ce8 100644 --- a/docs/e2e-test-plan.md +++ b/docs/e2e-test-plan.md @@ -163,16 +163,25 @@ Write `scripts/tests/e2e_crtc_pipeline.py`: ### DocServer Investigation -Word COM fails under SYSTEM account and "Run whether user is logged on or not" mode. -Requires interactive desktop session (RDP login). Auto-logon configured (registry keys set) -but blocked by hosting provider's Windows Server 2019 policy. +As of 2026-06, the worker runs fine under the SYSTEM account in session 0 on +this Windows Server 2019 box (Word COM initialises and converts normally), so the +old "requires an interactive RDP login" workaround is no longer needed for normal +operation. It is **self-healing**: -**Workaround:** RDP into the VM once after reboot → AtLogOn trigger fires → Word COM works. -LibreOffice fallback handles conversions automatically when DocServer is unavailable. +- It retries the MinIO connection with backoff instead of `sys.exit(1)`, so a + transient MinIO 502 / outage no longer kills it (that was the cause of a + multi-week outage in May 2026). +- The `PW-DocserverWorker` task restarts on failure (99×/1 min) and has a + 5-minute repeating safety trigger (`MultipleInstances=IgnoreNew`), so a crash + or missed boot trigger self-recovers within ~5 min without a reboot/RDP. + +LibreOffice fallback still handles conversions automatically if DocServer is ever +unavailable. See `docserver/README.md` → "Reliability / self-healing". ### Known Limitations -1. **DocServer** — requires RDP login after cold reboot (auto-logon blocked by hosting provider) +1. **DocServer** — self-healing (auto-restarts on MinIO outage/crash); RDP only + needed if Word COM itself breaks (DCOM misconfig → run `fix_dcom.bat`) 2. **eSign JWT** — test uses different secret than dev API; falls back to PG simulation 3. **Compliance Calendar** — DocType not imported to ERPNext; 417 error on query 4. **ERPNext screenshots** — Playwright can't log into ERPNext from Docker (login page structure) diff --git a/docs/formation-system.md b/docs/formation-system.md index fcf6116..d3f71c6 100644 --- a/docs/formation-system.md +++ b/docs/formation-system.md @@ -238,7 +238,7 @@ Flags for support conversations (escalation, priority, category). - Converts via Word COM, drops PDF in `converted/` bucket - Heartbeat file at `minio://performancewest/docserver-heartbeat.json` (60s interval) - Atomic uploads via `.tmp_` prefix + `copy_object` rename - - Task Scheduler: `PW-DocserverWorker` — auto-restart on failure + - Task Scheduler: `PW-DocserverWorker` — self-healing: restarts on failure (99×/1 min) + AtStartup and a 5-min repeating trigger (relaunches within ~5 min if the process dies). The worker also retries MinIO on outage instead of exiting. - **Fallback:** LibreOffice headless (`soffice --headless --convert-to pdf`) auto-activates when DocServer heartbeat stale (>5 min) - **E2E tested:** 36KB DOCX → 82KB PDF in 12 seconds total round-trip diff --git a/docs/infrastructure.md b/docs/infrastructure.md index 9da43f2..29d670e 100644 --- a/docs/infrastructure.md +++ b/docs/infrastructure.md @@ -18,14 +18,15 @@ | Resource | Spec | |----------|------| -| OS | Windows Server 2022 | +| IP / SSH | 108.181.102.34 (OpenSSH for Windows, **port 22422**) | +| OS | Windows Server 2019 (10.0.17763) | | vCPU | 2 | | RAM | 4 GB | | Disk | 40 GB SSD | -| Software | Microsoft Office 2021 + Python 3.12 | +| Software | Microsoft Word (Office 16.0) + Python 3.13 | | Service | docserver_worker.py (polls MinIO, converts via Word COM) | -Pixel-perfect DOCX→PDF conversion via Microsoft Word. Worker polls MinIO `to-convert/` bucket, converts via Word COM, uploads PDF to `converted/`. No HTTP server needed — MinIO is the transport. Requires RDP login after reboot (Word COM needs interactive session). LibreOffice headless is the automatic fallback. +Pixel-perfect DOCX→PDF conversion via Microsoft Word. Worker polls MinIO `to-convert/` bucket, converts via Word COM, uploads PDF to `converted/`. No HTTP server needed — MinIO is the transport (outbound-only; works behind any NAT). Runs under the SYSTEM account in session 0; **self-healing** — retries MinIO on outage instead of exiting, and the `PW-DocserverWorker` task restarts on failure + re-fires every 5 min if the process dies (no RDP needed for normal operation; RDP/`fix_dcom.bat` only if Word COM itself breaks). Heartbeat at `minio://performancewest/docserver-heartbeat.json` (60s). LibreOffice headless is the automatic fallback. Details: `docserver/README.md`. ## Email Servers diff --git a/docserver/README.md b/docserver/README.md index 277763f..37b3ee1 100644 --- a/docserver/README.md +++ b/docserver/README.md @@ -61,6 +61,32 @@ This will: The worker must run as a **logged-in user** — Word COM requires an interactive Windows session and will fail under a system service account. +## Reliability / self-healing + +The worker is designed to recover from outages without manual intervention: + +- **MinIO outages don't kill it.** The worker retries the MinIO connection + indefinitely with capped exponential backoff (5s → 120s) instead of exiting, + and each poll cycle is wrapped so a transient network error / 502 just + rebuilds the client and keeps going. (Previously a single 502 made the worker + `sys.exit(1)`, leaving it dead until a reboot.) +- **Crashes / kills are auto-recovered by Task Scheduler.** The + `PW-DocserverWorker` task has: + - `RestartCount=99`, `RestartInterval=1 min` — relaunch if the action fails, + - **two triggers**: `AtStartup` plus a **repeating trigger every 5 minutes** + with `MultipleInstances=IgnoreNew`, so if the process ever dies (crash, + manual kill, or a missed boot trigger) it relaunches within ~5 min and + never runs more than one instance, + - `StartWhenAvailable` to catch up a missed trigger. +- `start_worker.bat` **propagates Python's exit code** (`exit /b %rc%`) so + Scheduler can actually detect a failed run. + +To re-apply these task settings on an existing install, run as Administrator: + +```powershell +powershell -ExecutionPolicy Bypass -File C:\docserver\reconfigure_task.ps1 +``` + ## How to access MinIO externally The Windows VM needs to reach MinIO. Options: diff --git a/docserver/install.ps1 b/docserver/install.ps1 index 3124142..1985c33 100644 --- a/docserver/install.ps1 +++ b/docserver/install.ps1 @@ -166,11 +166,18 @@ $action = New-ScheduledTaskAction ` -Argument "/c `"$AppDir\start_worker.bat`"" ` -WorkingDirectory $AppDir -$trigger = New-ScheduledTaskTrigger -AtStartup +# Two triggers for self-healing: at boot, plus a repeating 5-minute safety net +# that relaunches the worker if its process ever dies (crash, manual kill, or a +# missed boot trigger). MultipleInstances=IgnoreNew keeps it to one instance. +$atStartup = New-ScheduledTaskTrigger -AtStartup +$repeat = New-ScheduledTaskTrigger -Once -At (Get-Date) ` + -RepetitionInterval (New-TimeSpan -Minutes 5) +try { $repeat.Repetition.Duration = 'P3650D' } catch {} # some builds need an explicit long duration +$trigger = @($atStartup, $repeat) $settings = New-ScheduledTaskSettingsSet ` -ExecutionTimeLimit (New-TimeSpan -Hours 0) ` - -RestartCount 10 ` + -RestartCount 99 ` -RestartInterval (New-TimeSpan -Minutes 1) ` -StartWhenAvailable ` -MultipleInstances IgnoreNew ` diff --git a/docserver/reconfigure_task.ps1 b/docserver/reconfigure_task.ps1 new file mode 100644 index 0000000..c4fc812 --- /dev/null +++ b/docserver/reconfigure_task.ps1 @@ -0,0 +1,46 @@ +# Reconfigures the PW-DocserverWorker scheduled task for self-healing: +# - restart up to 99x at 1-min intervals if the task action fails +# - StartWhenAvailable (catch up if a trigger was missed) +# - a repeating safety trigger every 5 min with MultipleInstances=IgnoreNew, +# so if the worker process ever dies (crash, manual kill, missed boot +# trigger) it relaunches within ~5 min instead of waiting for a reboot +# - keeps AtStartup + SYSTEM/Highest (current working config) +# Idempotent: safe to re-run. Run as Administrator. +$ErrorActionPreference = 'Stop' +$taskName = 'PW-DocserverWorker' +$appDir = 'C:\docserver' + +$action = New-ScheduledTaskAction -Execute 'cmd.exe' ` + -Argument "/c `"$appDir\start_worker.bat`"" -WorkingDirectory $appDir + +# Two triggers: at boot, and a repeating safety net every 5 minutes (indefinitely). +$atStartup = New-ScheduledTaskTrigger -AtStartup +$repeat = New-ScheduledTaskTrigger -Once -At (Get-Date) ` + -RepetitionInterval (New-TimeSpan -Minutes 5) +# Some Windows builds cap repetition without an explicit long duration; set ~10y. +try { $repeat.Repetition.Duration = 'P3650D' } catch {} + +$settings = New-ScheduledTaskSettingsSet ` + -ExecutionTimeLimit (New-TimeSpan -Hours 0) ` + -RestartCount 99 ` + -RestartInterval (New-TimeSpan -Minutes 1) ` + -StartWhenAvailable ` + -MultipleInstances IgnoreNew ` + -AllowStartIfOnBatteries ` + -DontStopIfGoingOnBatteries + +$principal = New-ScheduledTaskPrincipal -UserId 'SYSTEM' ` + -LogonType ServiceAccount -RunLevel Highest + +Register-ScheduledTask -TaskName $taskName -Action $action ` + -Trigger @($atStartup, $repeat) -Settings $settings -Principal $principal ` + -Description 'Performance West DOCX-to-PDF worker (MinIO + Word COM). Self-healing: restarts on failure + 5-min safety trigger.' ` + -Force | Out-Null + +Write-Host "Reconfigured ${taskName}:" +$ti = Get-ScheduledTask -TaskName $taskName +$ti.Triggers | ForEach-Object { Write-Host (" trigger: " + $_.CimClass.CimClassName) } +$s = $ti.Settings +Write-Host (" RestartCount=" + $s.RestartCount + " RestartInterval=" + $s.RestartInterval + + " StartWhenAvailable=" + $s.StartWhenAvailable + " MultipleInstances=" + $s.MultipleInstances) +Write-Host (" State=" + $ti.State) diff --git a/docserver/start_worker.bat b/docserver/start_worker.bat new file mode 100644 index 0000000..4dcf365 --- /dev/null +++ b/docserver/start_worker.bat @@ -0,0 +1,17 @@ +@echo off +setlocal enabledelayedexpansion +cd /d C:\docserver + +echo [%date% %time%] Starting Performance West Docserver Worker... + +for /f "usebackq tokens=1,* delims==" %%a in ("C:\docserver\docserver.env") do ( + set "ln=%%a" + if not "!ln:~0,1!"=="#" ( + if not "%%a"=="" set "%%a=%%b" + ) +) + +C:\Python313\python.exe C:\docserver\docserver_worker.py +set "rc=%errorlevel%" +echo [%date% %time%] Worker exited with code %rc%. +endlocal & exit /b %rc%