Companion to the worker MinIO-retry fix. Makes the worker auto-recover from process death (crash, manual kill, missed boot trigger), not just MinIO outages. - start_worker.bat: propagate Python's exit code (exit /b %rc%) so Task Scheduler can actually detect a failed run (it previously always exited 0). - reconfigure_task.ps1 (new): re-registers PW-DocserverWorker with RestartCount=99 / 1-min interval, StartWhenAvailable, and two triggers — AtStartup plus a 5-min repeating trigger with MultipleInstances=IgnoreNew, so a dead worker relaunches within ~5 min and never double-runs. Idempotent. - install.ps1: same self-healing settings for fresh installs. - Verified on the box: killed the worker -> task relaunched it; firing again while running stayed at one instance. Docs updated to match reality: - docserver/README.md: new 'Reliability / self-healing' section. - document-generation.md: corrected the stale 'Flask DocServer :5050 / HTTP' description to the actual MinIO outbound-only transport. - e2e-test-plan.md: removed the outdated 'Word COM fails under SYSTEM / requires RDP after every reboot' limitation; now self-healing under SYSTEM session 0. - infrastructure.md: fixed VM spec (Win Server 2019, Word 16.0, Python 3.13, SSH port 22422) + self-healing note. - architecture.md / formation-system.md: trigger + self-healing details.
334 lines
13 KiB
Markdown
334 lines
13 KiB
Markdown
# Performance West — Document Generation System
|
|
|
|
**Last updated:** 2026-03-27
|
|
|
|
## Overview
|
|
|
|
The document generation system produces professional compliance documents for customers. It supports two generation modes:
|
|
|
|
1. **Template-based** — DOCX templates with Jinja2 placeholders, filled with order data
|
|
2. **LLM-based** — Templates provide structure; Ollama generates analysis sections
|
|
|
|
All generated documents pass through a quality gate (admin review) before delivery.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ ERPNext │ (order data + intake forms)
|
|
└──────┬──────┘
|
|
│
|
|
┌──────┴──────┐
|
|
│ Worker │ (Python — polls for Queued orders)
|
|
└──────┬──────┘
|
|
│
|
|
┌────────────┼────────────┐
|
|
│ │
|
|
┌────────┴────────┐ ┌─────────┴─────────┐
|
|
│ Template-based │ │ LLM-based │
|
|
│ (DocxBuilder) │ │ (DocxBuilder + │
|
|
│ │ │ Ollama/LLM) │
|
|
└────────┬────────┘ └─────────┬─────────┘
|
|
│ │
|
|
└────────────┬────────────┘
|
|
│
|
|
┌──────┴──────┐
|
|
│ PDF Convert │
|
|
│ ┌─────────┐ │
|
|
│ │DocServer│ │ ← PRIMARY (Windows, MS Word COM, :5050)
|
|
│ │ :5050 │ │
|
|
│ └────┬────┘ │
|
|
│ │ fail │
|
|
│ ┌────┴────┐ │
|
|
│ │LibreOfc │ │ ← FALLBACK (headless, in Docker)
|
|
│ └─────────┘ │
|
|
└──────┬──────┘
|
|
│
|
|
┌──────┴──────┐
|
|
│ MinIO │ (upload DOCX + PDF)
|
|
└──────┬──────┘
|
|
│
|
|
┌──────┴──────┐
|
|
│ ERPNext │ (update status → Review)
|
|
└─────────────┘
|
|
```
|
|
|
|
## Template-Based Generation
|
|
|
|
### When Used
|
|
|
|
- Operating agreements (formation orders)
|
|
- Privacy policies
|
|
- Invoices
|
|
- CRTC registration letter (Canada CRTC Carrier Package)
|
|
- BC corporate binder (9 sections — cover page, incorporation certificate placeholder,
|
|
articles of incorporation, registered office, directors/officers, share structure,
|
|
CRTC registration, vendor directory, compliance calendar)
|
|
- Vendor directory PDF (Canadian telecom vendors and contacts)
|
|
- Any document where the content is deterministic (no analysis needed)
|
|
|
|
### How It Works
|
|
|
|
1. Worker fetches the `.docx` template from MinIO (`templates/{template-name}.docx`)
|
|
2. `DocxBuilder` loads the template via `python-docx`
|
|
3. Variables from the ERPNext order are substituted into Jinja2 placeholders
|
|
4. The filled document is saved as DOCX
|
|
5. LibreOffice converts DOCX to PDF
|
|
6. Both files are uploaded to MinIO
|
|
|
|
### DOCX Template Format
|
|
|
|
Templates are standard `.docx` files with Jinja2 syntax embedded in the text:
|
|
|
|
**Simple variables:**
|
|
```
|
|
This Operating Agreement of {{ entity_name }}, a limited liability company
|
|
organized under the laws of {{ state_name }}...
|
|
```
|
|
|
|
**Conditionals:**
|
|
```
|
|
{% if management_type == 'manager' %}
|
|
The Manager(s) of the Company shall be {{ managers }}.
|
|
{% else %}
|
|
All Members shall have the authority to manage the business.
|
|
{% endif %}
|
|
```
|
|
|
|
**Loops (for tables or repeated sections):**
|
|
```
|
|
{% for member in members %}
|
|
{{ member.name }} — {{ member.ownership_pct }}% ownership
|
|
{% endfor %}
|
|
```
|
|
|
|
**Section placeholders (for LLM-generated content):**
|
|
```
|
|
{{ executive_summary }}
|
|
{{ classification_analysis }}
|
|
{{ remediation_plan }}
|
|
```
|
|
|
|
### Creating a New Template
|
|
|
|
1. Run `python scripts/templates/create_templates.py` to generate the base templates, or create manually in Word/LibreOffice
|
|
2. Use `{{ variable_name }}` for all dynamic content
|
|
3. Use Times New Roman for body text, navy blue (`#2D4E78`) for headings
|
|
4. Include the Performance West header, confidentiality footer, and page numbers
|
|
5. Save as `.docx` (not `.doc`)
|
|
6. Upload to MinIO: `mc cp template.docx minio/performancewest/templates/`
|
|
|
|
### Modifying an Existing Template
|
|
|
|
1. Download from MinIO: `mc cp minio/performancewest/templates/name.docx .`
|
|
2. Edit in Word or LibreOffice — preserve all `{{ }}` placeholders
|
|
3. Test locally: `python -c "from scripts.document_gen.docx_builder import DocxBuilder; ..."`
|
|
4. Upload the updated template back to MinIO
|
|
5. Existing generated documents are not affected (they are separate files)
|
|
|
|
## LLM-Based Generation
|
|
|
|
### When Used
|
|
|
|
- FLSA/wage & hour audit reports
|
|
- CCPA/CPRA compliance audit reports
|
|
- TCPA consent audit reports
|
|
- Independent contractor classification assessments
|
|
- Employee handbook reviews
|
|
- Data breach response plans
|
|
|
|
### How It Works
|
|
|
|
1. Worker fetches the DOCX template (provides structure and formatting)
|
|
2. Worker constructs a prompt from the service-specific handler + intake data
|
|
3. Worker sends the prompt to Ollama (qwen2.5:7b running locally)
|
|
4. LLM returns analysis text for each section
|
|
5. `DocxBuilder.insert_section()` replaces section placeholders with LLM output
|
|
6. Simple variables (company name, dates) are filled via `DocxBuilder.fill()`
|
|
7. Document is converted to PDF and uploaded to MinIO
|
|
8. Status is always set to **Review** — LLM output must be human-reviewed
|
|
|
|
### Prompt Engineering Guidelines
|
|
|
|
Each compliance service has a dedicated handler in `scripts/workers/services/` that constructs the prompt. Follow these guidelines:
|
|
|
|
**Structure:**
|
|
```
|
|
You are a compliance consultant preparing a {document_type} for {company_name}.
|
|
|
|
CONTEXT:
|
|
{intake_data formatted as structured text}
|
|
|
|
INSTRUCTIONS:
|
|
- Write in a professional, objective tone
|
|
- Cite specific regulations by name and section number
|
|
- Identify concrete findings (compliant, non-compliant, needs improvement)
|
|
- Provide actionable remediation steps with deadlines
|
|
- Do not include legal advice disclaimers (the template adds these)
|
|
|
|
OUTPUT FORMAT:
|
|
Return a JSON object with the following keys:
|
|
- executive_summary: 2-3 paragraph overview
|
|
- {section_name}: detailed analysis for each section
|
|
- remediation_plan: prioritized action items
|
|
|
|
Write for a business audience. Be specific, not generic.
|
|
```
|
|
|
|
**Key rules:**
|
|
- Always request JSON output — easier to parse and insert into template sections
|
|
- Include the intake data as structured context, not raw form dumps
|
|
- Specify the exact section names that match template placeholders
|
|
- Set temperature to 0.3 for consistency; compliance documents should not be creative
|
|
- Maximum token limit: 4096 per section to prevent rambling
|
|
- If the LLM returns malformed JSON, retry once with a stricter prompt
|
|
|
|
**Model selection:**
|
|
- Default: `qwen2.5:7b` (good balance of quality and speed for 16GB VRAM)
|
|
- For complex multi-state analysis: `qwen2.5:14b` if GPU memory allows
|
|
- Configured via `OLLAMA_MODEL` environment variable
|
|
|
|
## PDF Conversion
|
|
|
|
DOCX to PDF conversion uses a two-tier approach:
|
|
|
|
### PRIMARY: Windows DocServer (Microsoft Word COM)
|
|
|
|
A Windows server runs `docserver_worker.py` that uses Microsoft Word via COM
|
|
automation for pixel-perfect DOCX → PDF conversion. This produces the highest-
|
|
fidelity output (exact font rendering, correct page breaks, proper table
|
|
formatting).
|
|
|
|
The transport is **MinIO, not HTTP** — the Windows VM only makes **outbound**
|
|
connections to MinIO, so there are no open inbound ports / SSH tunnels and it
|
|
works behind any NAT:
|
|
|
|
```text
|
|
pdf_converter.py (Linux) MinIO (S3) docserver_worker.py (Windows)
|
|
PUT docx → to-convert/{id}.docx ─────────► │
|
|
│◄─ poll every 12s ───────┤
|
|
│ ├─ Word.SaveAs → PDF
|
|
GET pdf ← converted/{id}.pdf ◄──────────│◄─ PUT converted/{id}.pdf┘
|
|
DEL docx / DEL pdf (cleanup)
|
|
```
|
|
|
|
```python
|
|
# pdf_converter.py — primary path (simplified)
|
|
mc.put_object(bucket, f"to-convert/{job_id}.docx", docx_stream, length)
|
|
# ...poll until converted/{job_id}.pdf appears (DOCSERVER_TIMEOUT, default 120s)...
|
|
pdf_bytes = mc.get_object(bucket, f"converted/{job_id}.pdf").read()
|
|
```
|
|
|
|
The Windows worker is **self-healing**: it retries MinIO with backoff instead of
|
|
exiting on a transient outage, and its `PW-DocserverWorker` scheduled task
|
|
restarts on failure plus re-fires every 5 minutes if the process dies. See
|
|
`docserver/README.md` → "Reliability / self-healing".
|
|
|
|
### FALLBACK: LibreOffice Headless
|
|
|
|
If DocServer is unavailable (network error, timeout, Windows server down), the converter
|
|
falls back to LibreOffice in headless mode:
|
|
|
|
```bash
|
|
libreoffice --headless --convert-to pdf --outdir /tmp document.docx
|
|
```
|
|
|
|
### Converter Logic
|
|
|
|
The `pdf_converter.py` module handles:
|
|
- **DocServer first** — POST to `:5050/convert`, 60-second timeout
|
|
- **Fallback to LibreOffice** — if DocServer returns error or times out
|
|
- Retry logic (up to 3 attempts per converter)
|
|
- Temporary file cleanup
|
|
- Error reporting to ERPNext
|
|
- Logs which converter was used for each document
|
|
|
|
LibreOffice is installed in the Python worker Docker container (`scripts/Dockerfile`).
|
|
DocServer host is configured via `DOCSERVER_HOST` environment variable (default: `192.168.1.x`).
|
|
|
|
## MinIO Upload/Download
|
|
|
|
The `minio_client.py` module provides:
|
|
|
|
```python
|
|
# Upload a generated document
|
|
upload_document(
|
|
local_path="/tmp/operating-agreement.pdf",
|
|
minio_path="orders/FO-2026-0001/operating-agreement.pdf",
|
|
content_type="application/pdf",
|
|
)
|
|
|
|
# Download a template
|
|
download_template(
|
|
template_name="operating-agreement", # downloads operating-agreement.docx
|
|
local_path="/tmp/operating-agreement.docx",
|
|
)
|
|
|
|
# Generate a pre-signed URL for customer download
|
|
url = presign_url(
|
|
minio_path="orders/FO-2026-0001/operating-agreement.pdf",
|
|
expires=3600, # 1 hour
|
|
)
|
|
```
|
|
|
|
**Bucket structure:** See `docs/crm.md` for the full MinIO directory layout.
|
|
|
|
**Security:** MinIO is not exposed externally. The Express API generates time-limited pre-signed URLs for customer downloads.
|
|
|
|
## Quality Gates
|
|
|
|
### Admin Review
|
|
|
|
Every generated document enters **Review** status before delivery:
|
|
|
|
1. Admin opens the order in ERPNext
|
|
2. Downloads the DOCX/PDF from the attached MinIO link
|
|
3. Reviews for accuracy, completeness, and professionalism
|
|
4. Actions:
|
|
- **Approve** — moves to Ready
|
|
- **Request Revision** — moves to Revision with notes; worker re-generates
|
|
- **Reject** — flags for manual document creation
|
|
|
|
### Revision Loop
|
|
|
|
When a reviewer requests changes:
|
|
|
|
1. Order status returns to **Processing**
|
|
2. Reviewer's notes are stored in the ERPNext order comments
|
|
3. Worker re-generates with adjusted prompts or manual edits
|
|
4. Document re-enters **Review**
|
|
5. Maximum 3 automated revision cycles; after that, manual creation is required
|
|
|
|
## File Reference
|
|
|
|
```
|
|
scripts/
|
|
├── document_gen/
|
|
│ ├── __init__.py
|
|
│ ├── docx_builder.py # DOCX template filling (Jinja2 + python-docx)
|
|
│ ├── llm_writer.py # Ollama prompt construction and parsing
|
|
│ ├── minio_client.py # MinIO upload/download/presign
|
|
│ └── pdf_converter.py # LibreOffice headless DOCX→PDF
|
|
├── templates/
|
|
│ ├── create_templates.py # Generates all .docx templates (run once)
|
|
│ ├── crtc-registration-letter.docx # CRTC carrier registration letter template
|
|
│ ├── bc-corporate-binder.docx # BC corporate binder (9 sections)
|
|
│ ├── vendor-directory.docx # Canadian telecom vendor directory
|
|
│ └── *.docx # Other generated template files
|
|
└── workers/
|
|
├── base_worker.py # ERPNext polling loop, status transitions
|
|
├── erpnext_client.py # ERPNext REST API client
|
|
├── delivery_worker.py # Email delivery with SMTP
|
|
├── renewal_worker.py # Subscription renewal reminders
|
|
└── services/
|
|
├── base_handler.py # Base class for service handlers
|
|
├── privacy_policy.py # Template-based: fill and convert
|
|
├── breach_response.py # LLM: breach response plan
|
|
├── flsa_audit.py # LLM: FLSA audit report
|
|
├── ccpa_audit.py # LLM: CCPA audit report
|
|
├── consent_audit.py # LLM: TCPA consent audit
|
|
├── contractor_review.py # LLM: contractor classification
|
|
├── handbook_review.py # LLM: handbook review
|
|
├── campaign_review.py # LLM: marketing campaign review
|
|
└── dnc_review.py # LLM: DNC compliance review
|
|
```
|