new-site/docs/document-generation.md
justin b48d0cb799 docserver: self-healing Task Scheduler config + docs
Companion to the worker MinIO-retry fix. Makes the worker auto-recover from
process death (crash, manual kill, missed boot trigger), not just MinIO outages.

- start_worker.bat: propagate Python's exit code (exit /b %rc%) so Task
  Scheduler can actually detect a failed run (it previously always exited 0).
- reconfigure_task.ps1 (new): re-registers PW-DocserverWorker with
  RestartCount=99 / 1-min interval, StartWhenAvailable, and two triggers —
  AtStartup plus a 5-min repeating trigger with MultipleInstances=IgnoreNew, so
  a dead worker relaunches within ~5 min and never double-runs. Idempotent.
- install.ps1: same self-healing settings for fresh installs.
- Verified on the box: killed the worker -> task relaunched it; firing again
  while running stayed at one instance.

Docs updated to match reality:
- docserver/README.md: new 'Reliability / self-healing' section.
- document-generation.md: corrected the stale 'Flask DocServer :5050 / HTTP'
  description to the actual MinIO outbound-only transport.
- e2e-test-plan.md: removed the outdated 'Word COM fails under SYSTEM / requires
  RDP after every reboot' limitation; now self-healing under SYSTEM session 0.
- infrastructure.md: fixed VM spec (Win Server 2019, Word 16.0, Python 3.13,
  SSH port 22422) + self-healing note.
- architecture.md / formation-system.md: trigger + self-healing details.
2026-06-15 22:49:21 -05:00

13 KiB

Performance West — Document Generation System

Last updated: 2026-03-27

Overview

The document generation system produces professional compliance documents for customers. It supports two generation modes:

  1. Template-based — DOCX templates with Jinja2 placeholders, filled with order data
  2. LLM-based — Templates provide structure; Ollama generates analysis sections

All generated documents pass through a quality gate (admin review) before delivery.

Architecture

                    ┌─────────────┐
                    │   ERPNext   │  (order data + intake forms)
                    └──────┬──────┘
                           │
                    ┌──────┴──────┐
                    │   Worker    │  (Python — polls for Queued orders)
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │                         │
     ┌────────┴────────┐     ┌─────────┴─────────┐
     │  Template-based │     │    LLM-based      │
     │  (DocxBuilder)  │     │  (DocxBuilder +   │
     │                 │     │   Ollama/LLM)     │
     └────────┬────────┘     └─────────┬─────────┘
              │                         │
              └────────────┬────────────┘
                           │
                    ┌──────┴──────┐
                    │ PDF Convert │
                    │ ┌─────────┐ │
                    │ │DocServer│ │  ← PRIMARY (Windows, MS Word COM, :5050)
                    │ │ :5050   │ │
                    │ └────┬────┘ │
                    │      │ fail │
                    │ ┌────┴────┐ │
                    │ │LibreOfc │ │  ← FALLBACK (headless, in Docker)
                    │ └─────────┘ │
                    └──────┬──────┘
                           │
                    ┌──────┴──────┐
                    │    MinIO    │  (upload DOCX + PDF)
                    └──────┬──────┘
                           │
                    ┌──────┴──────┐
                    │   ERPNext   │  (update status → Review)
                    └─────────────┘

Template-Based Generation

When Used

  • Operating agreements (formation orders)
  • Privacy policies
  • Invoices
  • CRTC registration letter (Canada CRTC Carrier Package)
  • BC corporate binder (9 sections — cover page, incorporation certificate placeholder, articles of incorporation, registered office, directors/officers, share structure, CRTC registration, vendor directory, compliance calendar)
  • Vendor directory PDF (Canadian telecom vendors and contacts)
  • Any document where the content is deterministic (no analysis needed)

How It Works

  1. Worker fetches the .docx template from MinIO (templates/{template-name}.docx)
  2. DocxBuilder loads the template via python-docx
  3. Variables from the ERPNext order are substituted into Jinja2 placeholders
  4. The filled document is saved as DOCX
  5. LibreOffice converts DOCX to PDF
  6. Both files are uploaded to MinIO

DOCX Template Format

Templates are standard .docx files with Jinja2 syntax embedded in the text:

Simple variables:

This Operating Agreement of {{ entity_name }}, a limited liability company
organized under the laws of {{ state_name }}...

Conditionals:

{% if management_type == 'manager' %}
The Manager(s) of the Company shall be {{ managers }}.
{% else %}
All Members shall have the authority to manage the business.
{% endif %}

Loops (for tables or repeated sections):

{% for member in members %}
{{ member.name }} — {{ member.ownership_pct }}% ownership
{% endfor %}

Section placeholders (for LLM-generated content):

{{ executive_summary }}
{{ classification_analysis }}
{{ remediation_plan }}

Creating a New Template

  1. Run python scripts/templates/create_templates.py to generate the base templates, or create manually in Word/LibreOffice
  2. Use {{ variable_name }} for all dynamic content
  3. Use Times New Roman for body text, navy blue (#2D4E78) for headings
  4. Include the Performance West header, confidentiality footer, and page numbers
  5. Save as .docx (not .doc)
  6. Upload to MinIO: mc cp template.docx minio/performancewest/templates/

Modifying an Existing Template

  1. Download from MinIO: mc cp minio/performancewest/templates/name.docx .
  2. Edit in Word or LibreOffice — preserve all {{ }} placeholders
  3. Test locally: python -c "from scripts.document_gen.docx_builder import DocxBuilder; ..."
  4. Upload the updated template back to MinIO
  5. Existing generated documents are not affected (they are separate files)

LLM-Based Generation

When Used

  • FLSA/wage & hour audit reports
  • CCPA/CPRA compliance audit reports
  • TCPA consent audit reports
  • Independent contractor classification assessments
  • Employee handbook reviews
  • Data breach response plans

How It Works

  1. Worker fetches the DOCX template (provides structure and formatting)
  2. Worker constructs a prompt from the service-specific handler + intake data
  3. Worker sends the prompt to Ollama (qwen2.5:7b running locally)
  4. LLM returns analysis text for each section
  5. DocxBuilder.insert_section() replaces section placeholders with LLM output
  6. Simple variables (company name, dates) are filled via DocxBuilder.fill()
  7. Document is converted to PDF and uploaded to MinIO
  8. Status is always set to Review — LLM output must be human-reviewed

Prompt Engineering Guidelines

Each compliance service has a dedicated handler in scripts/workers/services/ that constructs the prompt. Follow these guidelines:

Structure:

You are a compliance consultant preparing a {document_type} for {company_name}.

CONTEXT:
{intake_data formatted as structured text}

INSTRUCTIONS:
- Write in a professional, objective tone
- Cite specific regulations by name and section number
- Identify concrete findings (compliant, non-compliant, needs improvement)
- Provide actionable remediation steps with deadlines
- Do not include legal advice disclaimers (the template adds these)

OUTPUT FORMAT:
Return a JSON object with the following keys:
- executive_summary: 2-3 paragraph overview
- {section_name}: detailed analysis for each section
- remediation_plan: prioritized action items

Write for a business audience. Be specific, not generic.

Key rules:

  • Always request JSON output — easier to parse and insert into template sections
  • Include the intake data as structured context, not raw form dumps
  • Specify the exact section names that match template placeholders
  • Set temperature to 0.3 for consistency; compliance documents should not be creative
  • Maximum token limit: 4096 per section to prevent rambling
  • If the LLM returns malformed JSON, retry once with a stricter prompt

Model selection:

  • Default: qwen2.5:7b (good balance of quality and speed for 16GB VRAM)
  • For complex multi-state analysis: qwen2.5:14b if GPU memory allows
  • Configured via OLLAMA_MODEL environment variable

PDF Conversion

DOCX to PDF conversion uses a two-tier approach:

PRIMARY: Windows DocServer (Microsoft Word COM)

A Windows server runs docserver_worker.py that uses Microsoft Word via COM automation for pixel-perfect DOCX → PDF conversion. This produces the highest- fidelity output (exact font rendering, correct page breaks, proper table formatting).

The transport is MinIO, not HTTP — the Windows VM only makes outbound connections to MinIO, so there are no open inbound ports / SSH tunnels and it works behind any NAT:

pdf_converter.py (Linux)                MinIO (S3)            docserver_worker.py (Windows)
  PUT docx → to-convert/{id}.docx ─────────►                          │
                                            │◄─ poll every 12s ───────┤
                                            │                         ├─ Word.SaveAs → PDF
  GET pdf  ← converted/{id}.pdf  ◄──────────│◄─ PUT converted/{id}.pdf┘
  DEL docx / DEL pdf (cleanup)
# pdf_converter.py — primary path (simplified)
mc.put_object(bucket, f"to-convert/{job_id}.docx", docx_stream, length)
# ...poll until converted/{job_id}.pdf appears (DOCSERVER_TIMEOUT, default 120s)...
pdf_bytes = mc.get_object(bucket, f"converted/{job_id}.pdf").read()

The Windows worker is self-healing: it retries MinIO with backoff instead of exiting on a transient outage, and its PW-DocserverWorker scheduled task restarts on failure plus re-fires every 5 minutes if the process dies. See docserver/README.md → "Reliability / self-healing".

FALLBACK: LibreOffice Headless

If DocServer is unavailable (network error, timeout, Windows server down), the converter falls back to LibreOffice in headless mode:

libreoffice --headless --convert-to pdf --outdir /tmp document.docx

Converter Logic

The pdf_converter.py module handles:

  • DocServer first — POST to :5050/convert, 60-second timeout
  • Fallback to LibreOffice — if DocServer returns error or times out
  • Retry logic (up to 3 attempts per converter)
  • Temporary file cleanup
  • Error reporting to ERPNext
  • Logs which converter was used for each document

LibreOffice is installed in the Python worker Docker container (scripts/Dockerfile). DocServer host is configured via DOCSERVER_HOST environment variable (default: 192.168.1.x).

MinIO Upload/Download

The minio_client.py module provides:

# Upload a generated document
upload_document(
    local_path="/tmp/operating-agreement.pdf",
    minio_path="orders/FO-2026-0001/operating-agreement.pdf",
    content_type="application/pdf",
)

# Download a template
download_template(
    template_name="operating-agreement",  # downloads operating-agreement.docx
    local_path="/tmp/operating-agreement.docx",
)

# Generate a pre-signed URL for customer download
url = presign_url(
    minio_path="orders/FO-2026-0001/operating-agreement.pdf",
    expires=3600,  # 1 hour
)

Bucket structure: See docs/crm.md for the full MinIO directory layout.

Security: MinIO is not exposed externally. The Express API generates time-limited pre-signed URLs for customer downloads.

Quality Gates

Admin Review

Every generated document enters Review status before delivery:

  1. Admin opens the order in ERPNext
  2. Downloads the DOCX/PDF from the attached MinIO link
  3. Reviews for accuracy, completeness, and professionalism
  4. Actions:
    • Approve — moves to Ready
    • Request Revision — moves to Revision with notes; worker re-generates
    • Reject — flags for manual document creation

Revision Loop

When a reviewer requests changes:

  1. Order status returns to Processing
  2. Reviewer's notes are stored in the ERPNext order comments
  3. Worker re-generates with adjusted prompts or manual edits
  4. Document re-enters Review
  5. Maximum 3 automated revision cycles; after that, manual creation is required

File Reference

scripts/
├── document_gen/
│   ├── __init__.py
│   ├── docx_builder.py          # DOCX template filling (Jinja2 + python-docx)
│   ├── llm_writer.py            # Ollama prompt construction and parsing
│   ├── minio_client.py          # MinIO upload/download/presign
│   └── pdf_converter.py         # LibreOffice headless DOCX→PDF
├── templates/
│   ├── create_templates.py      # Generates all .docx templates (run once)
│   ├── crtc-registration-letter.docx  # CRTC carrier registration letter template
│   ├── bc-corporate-binder.docx       # BC corporate binder (9 sections)
│   ├── vendor-directory.docx          # Canadian telecom vendor directory
│   └── *.docx                   # Other generated template files
└── workers/
    ├── base_worker.py           # ERPNext polling loop, status transitions
    ├── erpnext_client.py        # ERPNext REST API client
    ├── delivery_worker.py       # Email delivery with SMTP
    ├── renewal_worker.py        # Subscription renewal reminders
    └── services/
        ├── base_handler.py      # Base class for service handlers
        ├── privacy_policy.py    # Template-based: fill and convert
        ├── breach_response.py   # LLM: breach response plan
        ├── flsa_audit.py        # LLM: FLSA audit report
        ├── ccpa_audit.py        # LLM: CCPA audit report
        ├── consent_audit.py     # LLM: TCPA consent audit
        ├── contractor_review.py # LLM: contractor classification
        ├── handbook_review.py   # LLM: handbook review
        ├── campaign_review.py   # LLM: marketing campaign review
        └── dnc_review.py        # LLM: DNC compliance review