docserver: self-healing Task Scheduler config + docs

Companion to the worker MinIO-retry fix. Makes the worker auto-recover from
process death (crash, manual kill, missed boot trigger), not just MinIO outages.

- start_worker.bat: propagate Python's exit code (exit /b %rc%) so Task
  Scheduler can actually detect a failed run (it previously always exited 0).
- reconfigure_task.ps1 (new): re-registers PW-DocserverWorker with
  RestartCount=99 / 1-min interval, StartWhenAvailable, and two triggers —
  AtStartup plus a 5-min repeating trigger with MultipleInstances=IgnoreNew, so
  a dead worker relaunches within ~5 min and never double-runs. Idempotent.
- install.ps1: same self-healing settings for fresh installs.
- Verified on the box: killed the worker -> task relaunched it; firing again
  while running stayed at one instance.

Docs updated to match reality:
- docserver/README.md: new 'Reliability / self-healing' section.
- document-generation.md: corrected the stale 'Flask DocServer :5050 / HTTP'
  description to the actual MinIO outbound-only transport.
- e2e-test-plan.md: removed the outdated 'Word COM fails under SYSTEM / requires
  RDP after every reboot' limitation; now self-healing under SYSTEM session 0.
- infrastructure.md: fixed VM spec (Win Server 2019, Word 16.0, Python 3.13,
  SSH port 22422) + self-healing note.
- architecture.md / formation-system.md: trigger + self-healing details.
This commit is contained in:
justin 2026-06-15 22:49:21 -05:00
parent 7929413eeb
commit b48d0cb799
9 changed files with 150 additions and 24 deletions

View file

@ -106,8 +106,8 @@ ERPNext custom Frappe apps (baked into `performancewest-erpnext:latest`):
│ │ Office 365 Word (COM automation) │ │
│ │ Python 3.13 + pywin32 + minio SDK │ │
│ │ docserver_worker.py (MinIO poller, 12s interval) │ │
│ │ Task Scheduler: PW-DocserverWorker (AtLogOn) │ │
│ │ Auto-logon configured (requires RDP after cold reboot) │ │
│ │ Task Scheduler: PW-DocserverWorker (AtStartup + 5-min) │ │
│ │ Self-healing: restart-on-fail + MinIO-retry (no RDP) │ │
│ │ Private network: 10.4.20.247 → MinIO via nginx │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
@ -284,6 +284,10 @@ Workers: upload DOCX to minio://performancewest/to-convert/{uuid}.docx
- **Heartbeat:** DocServer writes `docserver-heartbeat.json` to MinIO every 60 seconds
- **Fallback:** If heartbeat is stale (>5 min), workers auto-switch to LibreOffice headless
- **Self-healing:** the worker retries MinIO with backoff instead of exiting on an
outage; the `PW-DocserverWorker` task restarts on failure (99×/1 min) and has a
5-min repeating trigger, so a crash/missed boot self-recovers without RDP. See
`docserver/README.md`.
## Boot Sequence

View file

@ -194,20 +194,36 @@ DOCX to PDF conversion uses a two-tier approach:
### PRIMARY: Windows DocServer (Microsoft Word COM)
A Windows server runs a Flask-based DocServer at `:5050` that uses Microsoft Word via COM
automation for pixel-perfect DOCX → PDF conversion. This produces the highest-fidelity
output (exact font rendering, correct page breaks, proper table formatting).
A Windows server runs `docserver_worker.py` that uses Microsoft Word via COM
automation for pixel-perfect DOCX → PDF conversion. This produces the highest-
fidelity output (exact font rendering, correct page breaks, proper table
formatting).
The transport is **MinIO, not HTTP** — the Windows VM only makes **outbound**
connections to MinIO, so there are no open inbound ports / SSH tunnels and it
works behind any NAT:
```text
pdf_converter.py (Linux) MinIO (S3) docserver_worker.py (Windows)
PUT docx → to-convert/{id}.docx ─────────► │
│◄─ poll every 12s ───────┤
│ ├─ Word.SaveAs → PDF
GET pdf ← converted/{id}.pdf ◄──────────│◄─ PUT converted/{id}.pdf┘
DEL docx / DEL pdf (cleanup)
```
```python
# pdf_converter.py — primary path
response = requests.post(
f"http://{DOCSERVER_HOST}:5050/convert",
files={"file": open(docx_path, "rb")},
timeout=60,
)
pdf_bytes = response.content
# pdf_converter.py — primary path (simplified)
mc.put_object(bucket, f"to-convert/{job_id}.docx", docx_stream, length)
# ...poll until converted/{job_id}.pdf appears (DOCSERVER_TIMEOUT, default 120s)...
pdf_bytes = mc.get_object(bucket, f"converted/{job_id}.pdf").read()
```
The Windows worker is **self-healing**: it retries MinIO with backoff instead of
exiting on a transient outage, and its `PW-DocserverWorker` scheduled task
restarts on failure plus re-fires every 5 minutes if the process dies. See
`docserver/README.md` → "Reliability / self-healing".
### FALLBACK: LibreOffice Headless
If DocServer is unavailable (network error, timeout, Windows server down), the converter

View file

@ -163,16 +163,25 @@ Write `scripts/tests/e2e_crtc_pipeline.py`:
### DocServer Investigation
Word COM fails under SYSTEM account and "Run whether user is logged on or not" mode.
Requires interactive desktop session (RDP login). Auto-logon configured (registry keys set)
but blocked by hosting provider's Windows Server 2019 policy.
As of 2026-06, the worker runs fine under the SYSTEM account in session 0 on
this Windows Server 2019 box (Word COM initialises and converts normally), so the
old "requires an interactive RDP login" workaround is no longer needed for normal
operation. It is **self-healing**:
**Workaround:** RDP into the VM once after reboot → AtLogOn trigger fires → Word COM works.
LibreOffice fallback handles conversions automatically when DocServer is unavailable.
- It retries the MinIO connection with backoff instead of `sys.exit(1)`, so a
transient MinIO 502 / outage no longer kills it (that was the cause of a
multi-week outage in May 2026).
- The `PW-DocserverWorker` task restarts on failure (99×/1 min) and has a
5-minute repeating safety trigger (`MultipleInstances=IgnoreNew`), so a crash
or missed boot trigger self-recovers within ~5 min without a reboot/RDP.
LibreOffice fallback still handles conversions automatically if DocServer is ever
unavailable. See `docserver/README.md` → "Reliability / self-healing".
### Known Limitations
1. **DocServer** — requires RDP login after cold reboot (auto-logon blocked by hosting provider)
1. **DocServer** — self-healing (auto-restarts on MinIO outage/crash); RDP only
needed if Word COM itself breaks (DCOM misconfig → run `fix_dcom.bat`)
2. **eSign JWT** — test uses different secret than dev API; falls back to PG simulation
3. **Compliance Calendar** — DocType not imported to ERPNext; 417 error on query
4. **ERPNext screenshots** — Playwright can't log into ERPNext from Docker (login page structure)

View file

@ -238,7 +238,7 @@ Flags for support conversations (escalation, priority, category).
- Converts via Word COM, drops PDF in `converted/` bucket
- Heartbeat file at `minio://performancewest/docserver-heartbeat.json` (60s interval)
- Atomic uploads via `.tmp_` prefix + `copy_object` rename
- Task Scheduler: `PW-DocserverWorker`auto-restart on failure
- Task Scheduler: `PW-DocserverWorker`self-healing: restarts on failure (99×/1 min) + AtStartup and a 5-min repeating trigger (relaunches within ~5 min if the process dies). The worker also retries MinIO on outage instead of exiting.
- **Fallback:** LibreOffice headless (`soffice --headless --convert-to pdf`) auto-activates when DocServer heartbeat stale (>5 min)
- **E2E tested:** 36KB DOCX → 82KB PDF in 12 seconds total round-trip

View file

@ -18,14 +18,15 @@
| Resource | Spec |
|----------|------|
| OS | Windows Server 2022 |
| IP / SSH | 108.181.102.34 (OpenSSH for Windows, **port 22422**) |
| OS | Windows Server 2019 (10.0.17763) |
| vCPU | 2 |
| RAM | 4 GB |
| Disk | 40 GB SSD |
| Software | Microsoft Office 2021 + Python 3.12 |
| Software | Microsoft Word (Office 16.0) + Python 3.13 |
| Service | docserver_worker.py (polls MinIO, converts via Word COM) |
Pixel-perfect DOCX→PDF conversion via Microsoft Word. Worker polls MinIO `to-convert/` bucket, converts via Word COM, uploads PDF to `converted/`. No HTTP server needed — MinIO is the transport. Requires RDP login after reboot (Word COM needs interactive session). LibreOffice headless is the automatic fallback.
Pixel-perfect DOCX→PDF conversion via Microsoft Word. Worker polls MinIO `to-convert/` bucket, converts via Word COM, uploads PDF to `converted/`. No HTTP server needed — MinIO is the transport (outbound-only; works behind any NAT). Runs under the SYSTEM account in session 0; **self-healing** — retries MinIO on outage instead of exiting, and the `PW-DocserverWorker` task restarts on failure + re-fires every 5 min if the process dies (no RDP needed for normal operation; RDP/`fix_dcom.bat` only if Word COM itself breaks). Heartbeat at `minio://performancewest/docserver-heartbeat.json` (60s). LibreOffice headless is the automatic fallback. Details: `docserver/README.md`.
## Email Servers