new-site/docs/document-generation.md
justin f8cd37ac8c Initial commit — Performance West telecom compliance platform
Includes: API (Express/TypeScript), Astro site, Python workers,
document generators, FCC compliance tools, Canada CRTC formation,
Ansible infrastructure, and deployment scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 06:54:22 -05:00

318 lines
12 KiB
Markdown

# Performance West — Document Generation System
**Last updated:** 2026-03-27
## Overview
The document generation system produces professional compliance documents for customers. It supports two generation modes:
1. **Template-based** — DOCX templates with Jinja2 placeholders, filled with order data
2. **LLM-based** — Templates provide structure; Ollama generates analysis sections
All generated documents pass through a quality gate (admin review) before delivery.
## Architecture
```
┌─────────────┐
│ ERPNext │ (order data + intake forms)
└──────┬──────┘
┌──────┴──────┐
│ Worker │ (Python — polls for Queued orders)
└──────┬──────┘
┌────────────┼────────────┐
│ │
┌────────┴────────┐ ┌─────────┴─────────┐
│ Template-based │ │ LLM-based │
│ (DocxBuilder) │ │ (DocxBuilder + │
│ │ │ Ollama/LLM) │
└────────┬────────┘ └─────────┬─────────┘
│ │
└────────────┬────────────┘
┌──────┴──────┐
│ PDF Convert │
│ ┌─────────┐ │
│ │DocServer│ │ ← PRIMARY (Windows, MS Word COM, :5050)
│ │ :5050 │ │
│ └────┬────┘ │
│ │ fail │
│ ┌────┴────┐ │
│ │LibreOfc │ │ ← FALLBACK (headless, in Docker)
│ └─────────┘ │
└──────┬──────┘
┌──────┴──────┐
│ MinIO │ (upload DOCX + PDF)
└──────┬──────┘
┌──────┴──────┐
│ ERPNext │ (update status → Review)
└─────────────┘
```
## Template-Based Generation
### When Used
- Operating agreements (formation orders)
- Privacy policies
- Invoices
- CRTC registration letter (Canada CRTC Carrier Package)
- BC corporate binder (9 sections — cover page, incorporation certificate placeholder,
articles of incorporation, registered office, directors/officers, share structure,
CRTC registration, vendor directory, compliance calendar)
- Vendor directory PDF (Canadian telecom vendors and contacts)
- Any document where the content is deterministic (no analysis needed)
### How It Works
1. Worker fetches the `.docx` template from MinIO (`templates/{template-name}.docx`)
2. `DocxBuilder` loads the template via `python-docx`
3. Variables from the ERPNext order are substituted into Jinja2 placeholders
4. The filled document is saved as DOCX
5. LibreOffice converts DOCX to PDF
6. Both files are uploaded to MinIO
### DOCX Template Format
Templates are standard `.docx` files with Jinja2 syntax embedded in the text:
**Simple variables:**
```
This Operating Agreement of {{ entity_name }}, a limited liability company
organized under the laws of {{ state_name }}...
```
**Conditionals:**
```
{% if management_type == 'manager' %}
The Manager(s) of the Company shall be {{ managers }}.
{% else %}
All Members shall have the authority to manage the business.
{% endif %}
```
**Loops (for tables or repeated sections):**
```
{% for member in members %}
{{ member.name }} — {{ member.ownership_pct }}% ownership
{% endfor %}
```
**Section placeholders (for LLM-generated content):**
```
{{ executive_summary }}
{{ classification_analysis }}
{{ remediation_plan }}
```
### Creating a New Template
1. Run `python scripts/templates/create_templates.py` to generate the base templates, or create manually in Word/LibreOffice
2. Use `{{ variable_name }}` for all dynamic content
3. Use Times New Roman for body text, navy blue (`#2D4E78`) for headings
4. Include the Performance West header, confidentiality footer, and page numbers
5. Save as `.docx` (not `.doc`)
6. Upload to MinIO: `mc cp template.docx minio/performancewest/templates/`
### Modifying an Existing Template
1. Download from MinIO: `mc cp minio/performancewest/templates/name.docx .`
2. Edit in Word or LibreOffice — preserve all `{{ }}` placeholders
3. Test locally: `python -c "from scripts.document_gen.docx_builder import DocxBuilder; ..."`
4. Upload the updated template back to MinIO
5. Existing generated documents are not affected (they are separate files)
## LLM-Based Generation
### When Used
- FLSA/wage & hour audit reports
- CCPA/CPRA compliance audit reports
- TCPA consent audit reports
- Independent contractor classification assessments
- Employee handbook reviews
- Data breach response plans
### How It Works
1. Worker fetches the DOCX template (provides structure and formatting)
2. Worker constructs a prompt from the service-specific handler + intake data
3. Worker sends the prompt to Ollama (qwen2.5:7b running locally)
4. LLM returns analysis text for each section
5. `DocxBuilder.insert_section()` replaces section placeholders with LLM output
6. Simple variables (company name, dates) are filled via `DocxBuilder.fill()`
7. Document is converted to PDF and uploaded to MinIO
8. Status is always set to **Review** — LLM output must be human-reviewed
### Prompt Engineering Guidelines
Each compliance service has a dedicated handler in `scripts/workers/services/` that constructs the prompt. Follow these guidelines:
**Structure:**
```
You are a compliance consultant preparing a {document_type} for {company_name}.
CONTEXT:
{intake_data formatted as structured text}
INSTRUCTIONS:
- Write in a professional, objective tone
- Cite specific regulations by name and section number
- Identify concrete findings (compliant, non-compliant, needs improvement)
- Provide actionable remediation steps with deadlines
- Do not include legal advice disclaimers (the template adds these)
OUTPUT FORMAT:
Return a JSON object with the following keys:
- executive_summary: 2-3 paragraph overview
- {section_name}: detailed analysis for each section
- remediation_plan: prioritized action items
Write for a business audience. Be specific, not generic.
```
**Key rules:**
- Always request JSON output — easier to parse and insert into template sections
- Include the intake data as structured context, not raw form dumps
- Specify the exact section names that match template placeholders
- Set temperature to 0.3 for consistency; compliance documents should not be creative
- Maximum token limit: 4096 per section to prevent rambling
- If the LLM returns malformed JSON, retry once with a stricter prompt
**Model selection:**
- Default: `qwen2.5:7b` (good balance of quality and speed for 16GB VRAM)
- For complex multi-state analysis: `qwen2.5:14b` if GPU memory allows
- Configured via `OLLAMA_MODEL` environment variable
## PDF Conversion
DOCX to PDF conversion uses a two-tier approach:
### PRIMARY: Windows DocServer (Microsoft Word COM)
A Windows server runs a Flask-based DocServer at `:5050` that uses Microsoft Word via COM
automation for pixel-perfect DOCX → PDF conversion. This produces the highest-fidelity
output (exact font rendering, correct page breaks, proper table formatting).
```python
# pdf_converter.py — primary path
response = requests.post(
f"http://{DOCSERVER_HOST}:5050/convert",
files={"file": open(docx_path, "rb")},
timeout=60,
)
pdf_bytes = response.content
```
### FALLBACK: LibreOffice Headless
If DocServer is unavailable (network error, timeout, Windows server down), the converter
falls back to LibreOffice in headless mode:
```bash
libreoffice --headless --convert-to pdf --outdir /tmp document.docx
```
### Converter Logic
The `pdf_converter.py` module handles:
- **DocServer first** — POST to `:5050/convert`, 60-second timeout
- **Fallback to LibreOffice** — if DocServer returns error or times out
- Retry logic (up to 3 attempts per converter)
- Temporary file cleanup
- Error reporting to ERPNext
- Logs which converter was used for each document
LibreOffice is installed in the Python worker Docker container (`scripts/Dockerfile`).
DocServer host is configured via `DOCSERVER_HOST` environment variable (default: `192.168.1.x`).
## MinIO Upload/Download
The `minio_client.py` module provides:
```python
# Upload a generated document
upload_document(
local_path="/tmp/operating-agreement.pdf",
minio_path="orders/FO-2026-0001/operating-agreement.pdf",
content_type="application/pdf",
)
# Download a template
download_template(
template_name="operating-agreement", # downloads operating-agreement.docx
local_path="/tmp/operating-agreement.docx",
)
# Generate a pre-signed URL for customer download
url = presign_url(
minio_path="orders/FO-2026-0001/operating-agreement.pdf",
expires=3600, # 1 hour
)
```
**Bucket structure:** See `docs/crm.md` for the full MinIO directory layout.
**Security:** MinIO is not exposed externally. The Express API generates time-limited pre-signed URLs for customer downloads.
## Quality Gates
### Admin Review
Every generated document enters **Review** status before delivery:
1. Admin opens the order in ERPNext
2. Downloads the DOCX/PDF from the attached MinIO link
3. Reviews for accuracy, completeness, and professionalism
4. Actions:
- **Approve** — moves to Ready
- **Request Revision** — moves to Revision with notes; worker re-generates
- **Reject** — flags for manual document creation
### Revision Loop
When a reviewer requests changes:
1. Order status returns to **Processing**
2. Reviewer's notes are stored in the ERPNext order comments
3. Worker re-generates with adjusted prompts or manual edits
4. Document re-enters **Review**
5. Maximum 3 automated revision cycles; after that, manual creation is required
## File Reference
```
scripts/
├── document_gen/
│ ├── __init__.py
│ ├── docx_builder.py # DOCX template filling (Jinja2 + python-docx)
│ ├── llm_writer.py # Ollama prompt construction and parsing
│ ├── minio_client.py # MinIO upload/download/presign
│ └── pdf_converter.py # LibreOffice headless DOCX→PDF
├── templates/
│ ├── create_templates.py # Generates all .docx templates (run once)
│ ├── crtc-registration-letter.docx # CRTC carrier registration letter template
│ ├── bc-corporate-binder.docx # BC corporate binder (9 sections)
│ ├── vendor-directory.docx # Canadian telecom vendor directory
│ └── *.docx # Other generated template files
└── workers/
├── base_worker.py # ERPNext polling loop, status transitions
├── erpnext_client.py # ERPNext REST API client
├── delivery_worker.py # Email delivery with SMTP
├── renewal_worker.py # Subscription renewal reminders
└── services/
├── base_handler.py # Base class for service handlers
├── privacy_policy.py # Template-based: fill and convert
├── breach_response.py # LLM: breach response plan
├── flsa_audit.py # LLM: FLSA audit report
├── ccpa_audit.py # LLM: CCPA audit report
├── consent_audit.py # LLM: TCPA consent audit
├── contractor_review.py # LLM: contractor classification
├── handbook_review.py # LLM: handbook review
├── campaign_review.py # LLM: marketing campaign review
└── dnc_review.py # LLM: DNC compliance review
```