feat: implement template-based documentation generation system for Proxmox

Implement a scalable system for automatic documentation generation from infrastructure systems, preventing LLM context overload through template-driven sectioning. **New Features:** 1. **YAML Template System** (`templates/documentation/proxmox.yaml`) - Define documentation sections independently - Specify data requirements per section - Configure prompts, generation settings, and scheduling - Prevents LLM context overflow by sectioning data 2. **Template-Based Generator** (`src/datacenter_docs/generators/template_generator.py`) - Load and parse YAML templates - Generate documentation sections independently - Extract only required data for each section - Save sections individually to files and database - Combine sections with table of contents 3. **Celery Tasks** (`src/datacenter_docs/workers/documentation_tasks.py`) - `collect_and_generate_docs`: Collect data and generate docs - `generate_proxmox_docs`: Scheduled Proxmox documentation (daily at 2 AM) - `generate_all_docs`: Generate docs for all systems in parallel - `index_generated_docs`: Index generated docs into vector store for RAG - `full_docs_pipeline`: Complete workflow (collect → generate → index) 4. **Scheduled Jobs** (updated `celery_app.py`) - Daily Proxmox documentation generation - Every 6 hours: all systems documentation - Weekly: full pipeline with indexing - Proper task routing and rate limiting 5. **Test Script** (`scripts/test_proxmox_docs.py`) - End-to-end testing of documentation generation - Mock data collection from Proxmox - Template-based generation - File and database storage 6. **Configuration Updates** (`src/datacenter_docs/utils/config.py`) - Add port configuration fields for Docker services - Add MongoDB and Redis credentials - Support all required environment variables **Proxmox Documentation Sections:** - Infrastructure Overview (cluster, nodes, stats) - Virtual Machines Inventory - LXC Containers Inventory - Storage Configuration - Network Configuration - Maintenance Procedures **Benefits:** - Scalable to multiple infrastructure systems - Prevents LLM context window overflow - Independent section generation - Scheduled automatic updates - Vector store integration for RAG chat - Template-driven approach for consistency 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-20 19:23:30 +02:00
parent 27dd9e00b6
commit 16fc8e2659
6 changed files with 1178 additions and 1 deletions
--- a/src/datacenter_docs/generators/template_generator.py
+++ b/src/datacenter_docs/generators/template_generator.py
@@ -0,0 +1,409 @@
+"""
+Template-Based Documentation Generator
+
+Generates documentation using YAML templates that define sections and prompts.
+This approach prevents LLM context overload by generating documentation in sections.
+"""
+
+import json
+import logging
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+import yaml
+
+from datacenter_docs.generators.base import BaseGenerator
+
+logger = logging.getLogger(__name__)
+
+
+class DocumentationTemplate:
+    """Represents a documentation template loaded from YAML"""
+
+    def __init__(self, template_path: Path):
+        """
+        Initialize template from YAML file
+
+        Args:
+            template_path: Path to YAML template file
+        """
+        self.path = template_path
+        self.data = self._load_template()
+
+    def _load_template(self) -> Dict[str, Any]:
+        """Load and parse YAML template"""
+        try:
+            with open(self.path, "r", encoding="utf-8") as f:
+                return yaml.safe_load(f)
+        except Exception as e:
+            logger.error(f"Failed to load template {self.path}: {e}")
+            raise
+
+    @property
+    def name(self) -> str:
+        """Get template name"""
+        return self.data.get("metadata", {}).get("name", "Unknown")
+
+    @property
+    def collector(self) -> str:
+        """Get required collector name"""
+        return self.data.get("metadata", {}).get("collector", "")
+
+    @property
+    def sections(self) -> List[Dict[str, Any]]:
+        """Get documentation sections"""
+        return self.data.get("sections", [])
+
+    @property
+    def generation_config(self) -> Dict[str, Any]:
+        """Get generation configuration"""
+        return self.data.get("generation", {})
+
+    @property
+    def output_config(self) -> Dict[str, Any]:
+        """Get output configuration"""
+        return self.data.get("output", {})
+
+    @property
+    def schedule_config(self) -> Dict[str, Any]:
+        """Get schedule configuration"""
+        return self.data.get("schedule", {})
+
+
+class TemplateBasedGenerator(BaseGenerator):
+    """
+    Generator that uses YAML templates to generate sectioned documentation
+
+    This prevents LLM context overload by:
+    1. Loading templates that define sections
+    2. Generating each section independently
+    3. Using only required data for each section
+    """
+
+    def __init__(self, template_path: str):
+        """
+        Initialize template-based generator
+
+        Args:
+            template_path: Path to YAML template file
+        """
+        self.template = DocumentationTemplate(Path(template_path))
+        super().__init__(
+            name=self.template.collector, section=f"{self.template.collector}_docs"
+        )
+
+    async def generate(self, data: Dict[str, Any]) -> str:
+        """
+        Generate complete documentation using template
+
+        This method orchestrates the generation of all sections.
+
+        Args:
+            data: Collected infrastructure data
+
+        Returns:
+            Combined documentation (all sections)
+        """
+        self.logger.info(
+            f"Generating documentation for {self.template.name} using template"
+        )
+
+        # Validate data matches template collector
+        collector_name = data.get("metadata", {}).get("collector", "")
+        if collector_name != self.template.collector:
+            self.logger.warning(
+                f"Data collector ({collector_name}) doesn't match template ({self.template.collector})"
+            )
+
+        # Generate each section
+        sections_content = []
+        for section_def in self.template.sections:
+            section_content = await self.generate_section(section_def, data)
+            if section_content:
+                sections_content.append(section_content)
+
+        # Combine all sections
+        combined_doc = self._combine_sections(sections_content)
+
+        return combined_doc
+
+    async def generate_section(
+        self, section_def: Dict[str, Any], full_data: Dict[str, Any]
+    ) -> Optional[str]:
+        """
+        Generate a single documentation section
+
+        Args:
+            section_def: Section definition from template
+            full_data: Complete collected data
+
+        Returns:
+            Generated section content in Markdown
+        """
+        section_id = section_def.get("id", "unknown")
+        section_title = section_def.get("title", "Untitled Section")
+        data_requirements = section_def.get("data_requirements", [])
+        prompt_template = section_def.get("prompt_template", "")
+
+        self.logger.info(f"Generating section: {section_title}")
+
+        # Extract only required data for this section
+        section_data = self._extract_section_data(full_data, data_requirements)
+
+        # Build prompt by substituting placeholders
+        prompt = self._build_prompt(prompt_template, section_data)
+
+        # Get generation config
+        gen_config = self.template.generation_config
+        temperature = gen_config.get("temperature", 0.7)
+        max_tokens = gen_config.get("max_tokens", 4000)
+
+        # System prompt for documentation generation
+        system_prompt = """You are a technical documentation expert specializing in datacenter infrastructure.
+Generate clear, accurate, and well-structured documentation in Markdown format.
+
+Guidelines:
+- Use proper Markdown formatting (headers, tables, lists, code blocks)
+- Be precise and factual based on provided data
+- Include practical examples and recommendations
+- Use tables for structured data
+- Use bullet points for lists
+- Use code blocks for commands/configurations
+- Organize content with clear sections
+- Write in a professional but accessible tone
+"""
+
+        try:
+            # Generate content using LLM
+            content = await self.generate_with_llm(
+                system_prompt=system_prompt,
+                user_prompt=prompt,
+                temperature=temperature,
+                max_tokens=max_tokens,
+            )
+
+            # Add section header
+            section_content = f"# {section_title}\n\n{content}\n\n"
+
+            self.logger.info(f"✓ Section '{section_title}' generated successfully")
+            return section_content
+
+        except Exception as e:
+            self.logger.error(f"Failed to generate section '{section_title}': {e}")
+            return None
+
+    def _extract_section_data(
+        self, full_data: Dict[str, Any], requirements: List[str]
+    ) -> Dict[str, Any]:
+        """
+        Extract only required data for a section
+
+        Args:
+            full_data: Complete collected data
+            requirements: List of required data keys
+
+        Returns:
+            Dictionary with only required data
+        """
+        section_data = {}
+        data_section = full_data.get("data", {})
+
+        for req in requirements:
+            if req in data_section:
+                section_data[req] = data_section[req]
+            else:
+                self.logger.warning(f"Required data '{req}' not found in collected data")
+                section_data[req] = None
+
+        return section_data
+
+    def _build_prompt(self, template: str, data: Dict[str, Any]) -> str:
+        """
+        Build prompt by substituting data into template
+
+        Args:
+            template: Prompt template with {placeholders}
+            data: Data to substitute
+
+        Returns:
+            Completed prompt
+        """
+        prompt = template
+
+        # Replace each placeholder with formatted data
+        for key, value in data.items():
+            placeholder = f"{{{key}}}"
+            if placeholder in prompt:
+                # Format data for prompt
+                formatted_value = self._format_data_for_prompt(value)
+                prompt = prompt.replace(placeholder, formatted_value)
+
+        return prompt
+
+    def _format_data_for_prompt(self, data: Any) -> str:
+        """
+        Format data for inclusion in LLM prompt
+
+        Args:
+            data: Data to format (dict, list, str, etc.)
+
+        Returns:
+            Formatted string representation
+        """
+        if data is None:
+            return "No data available"
+
+        if isinstance(data, (dict, list)):
+            # Pretty print JSON for structured data
+            try:
+                return json.dumps(data, indent=2, default=str)
+            except Exception:
+                return str(data)
+
+        return str(data)
+
+    def _combine_sections(self, sections: List[str]) -> str:
+        """
+        Combine all sections into a single document
+
+        Args:
+            sections: List of section contents
+
+        Returns:
+            Combined markdown document
+        """
+        # Add document header
+        header = f"""# {self.template.name} Documentation
+
+*Generated automatically from infrastructure data*
+
+---
+
+"""
+
+        # Add table of contents
+        toc = "## Table of Contents\n\n"
+        for i, section in enumerate(sections, 1):
+            # Extract section title from first line
+            lines = section.strip().split("\n")
+            if lines:
+                title = lines[0].replace("#", "").strip()
+                toc += f"{i}. [{title}](#{title.lower().replace(' ', '-')})\n"
+        toc += "\n---\n\n"
+
+        # Combine all parts
+        combined = header + toc + "\n".join(sections)
+
+        return combined
+
+    async def generate_and_save_sections(
+        self, data: Dict[str, Any], save_individually: bool = True
+    ) -> List[Dict[str, Any]]:
+        """
+        Generate and save each section individually
+
+        This is useful for very large documentation where you want each
+        section as a separate file.
+
+        Args:
+            data: Collected infrastructure data
+            save_individually: Save each section as separate file
+
+        Returns:
+            List of results for each section
+        """
+        results = []
+        output_config = self.template.output_config
+        output_dir = output_config.get("directory", "output")
+        save_to_db = output_config.get("save_to_database", True)
+        save_to_file = output_config.get("save_to_file", True)
+
+        for section_def in self.template.sections:
+            section_id = section_def.get("id")
+            section_title = section_def.get("title")
+
+            # Generate section
+            content = await self.generate_section(section_def, data)
+
+            if not content:
+                results.append(
+                    {
+                        "section_id": section_id,
+                        "success": False,
+                        "error": "Generation failed",
+                    }
+                )
+                continue
+
+            result = {
+                "section_id": section_id,
+                "title": section_title,
+                "success": True,
+                "content": content,
+            }
+
+            # Save section if requested
+            if save_individually:
+                if save_to_file:
+                    # Save to file
+                    output_path = Path(output_dir)
+                    output_path.mkdir(parents=True, exist_ok=True)
+                    filename = f"{section_id}.md"
+                    file_path = output_path / filename
+                    file_path.write_text(content, encoding="utf-8")
+                    result["file_path"] = str(file_path)
+                    self.logger.info(f"Saved section to: {file_path}")
+
+                if save_to_db:
+                    # Save to database
+                    metadata = {
+                        "section_id": section_id,
+                        "template": str(self.template.path),
+                        "category": section_def.get("category", ""),
+                    }
+                    # Create temporary generator for this section
+                    temp_gen = BaseGenerator.__new__(BaseGenerator)
+                    temp_gen.name = self.name
+                    temp_gen.section = section_id
+                    temp_gen.logger = self.logger
+                    temp_gen.llm = self.llm
+                    await temp_gen.save_to_database(content, metadata)
+
+            results.append(result)
+
+        return results
+
+
+async def example_usage() -> None:
+    """Example of using template-based generator"""
+    from datacenter_docs.collectors.proxmox_collector import ProxmoxCollector
+
+    # Collect data
+    collector = ProxmoxCollector()
+    collect_result = await collector.run()
+
+    if not collect_result["success"]:
+        print(f"❌ Collection failed: {collect_result['error']}")
+        return
+
+    # Generate documentation using template
+    template_path = "templates/documentation/proxmox.yaml"
+    generator = TemplateBasedGenerator(template_path)
+
+    # Generate and save all sections
+    sections_results = await generator.generate_and_save_sections(
+        data=collect_result["data"], save_individually=True
+    )
+
+    # Print results
+    for result in sections_results:
+        if result["success"]:
+            print(f"✅ Section '{result['title']}' generated successfully")
+        else:
+            print(f"❌ Section '{result.get('section_id')}' failed: {result.get('error')}")
+
+
+if __name__ == "__main__":
+    import asyncio
+
+    asyncio.run(example_usage())