feat: implement template-based documentation generation system for Proxmox
Some checks failed
CI/CD Pipeline / Run Tests (push) Has been cancelled
CI/CD Pipeline / Security Scanning (push) Has been cancelled
CI/CD Pipeline / Build and Push Docker Images (api) (push) Has been cancelled
CI/CD Pipeline / Build and Push Docker Images (chat) (push) Has been cancelled
CI/CD Pipeline / Build and Push Docker Images (frontend) (push) Has been cancelled
CI/CD Pipeline / Build and Push Docker Images (worker) (push) Has been cancelled
CI/CD Pipeline / Deploy to Staging (push) Has been cancelled
CI/CD Pipeline / Deploy to Production (push) Has been cancelled
CI/CD Pipeline / Generate Documentation (push) Has started running
CI/CD Pipeline / Lint Code (push) Has started running

Implement a scalable system for automatic documentation generation from infrastructure
systems, preventing LLM context overload through template-driven sectioning.

**New Features:**

1. **YAML Template System** (`templates/documentation/proxmox.yaml`)
   - Define documentation sections independently
   - Specify data requirements per section
   - Configure prompts, generation settings, and scheduling
   - Prevents LLM context overflow by sectioning data

2. **Template-Based Generator** (`src/datacenter_docs/generators/template_generator.py`)
   - Load and parse YAML templates
   - Generate documentation sections independently
   - Extract only required data for each section
   - Save sections individually to files and database
   - Combine sections with table of contents

3. **Celery Tasks** (`src/datacenter_docs/workers/documentation_tasks.py`)
   - `collect_and_generate_docs`: Collect data and generate docs
   - `generate_proxmox_docs`: Scheduled Proxmox documentation (daily at 2 AM)
   - `generate_all_docs`: Generate docs for all systems in parallel
   - `index_generated_docs`: Index generated docs into vector store for RAG
   - `full_docs_pipeline`: Complete workflow (collect → generate → index)

4. **Scheduled Jobs** (updated `celery_app.py`)
   - Daily Proxmox documentation generation
   - Every 6 hours: all systems documentation
   - Weekly: full pipeline with indexing
   - Proper task routing and rate limiting

5. **Test Script** (`scripts/test_proxmox_docs.py`)
   - End-to-end testing of documentation generation
   - Mock data collection from Proxmox
   - Template-based generation
   - File and database storage

6. **Configuration Updates** (`src/datacenter_docs/utils/config.py`)
   - Add port configuration fields for Docker services
   - Add MongoDB and Redis credentials
   - Support all required environment variables

**Proxmox Documentation Sections:**
- Infrastructure Overview (cluster, nodes, stats)
- Virtual Machines Inventory
- LXC Containers Inventory
- Storage Configuration
- Network Configuration
- Maintenance Procedures

**Benefits:**
- Scalable to multiple infrastructure systems
- Prevents LLM context window overflow
- Independent section generation
- Scheduled automatic updates
- Vector store integration for RAG chat
- Template-driven approach for consistency

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-10-20 19:23:30 +02:00
parent 27dd9e00b6
commit 16fc8e2659
6 changed files with 1178 additions and 1 deletions

View File

@@ -0,0 +1,409 @@
"""
Template-Based Documentation Generator
Generates documentation using YAML templates that define sections and prompts.
This approach prevents LLM context overload by generating documentation in sections.
"""
import json
import logging
from pathlib import Path
from typing import Any, Dict, List, Optional
import yaml
from datacenter_docs.generators.base import BaseGenerator
logger = logging.getLogger(__name__)
class DocumentationTemplate:
"""Represents a documentation template loaded from YAML"""
def __init__(self, template_path: Path):
"""
Initialize template from YAML file
Args:
template_path: Path to YAML template file
"""
self.path = template_path
self.data = self._load_template()
def _load_template(self) -> Dict[str, Any]:
"""Load and parse YAML template"""
try:
with open(self.path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)
except Exception as e:
logger.error(f"Failed to load template {self.path}: {e}")
raise
@property
def name(self) -> str:
"""Get template name"""
return self.data.get("metadata", {}).get("name", "Unknown")
@property
def collector(self) -> str:
"""Get required collector name"""
return self.data.get("metadata", {}).get("collector", "")
@property
def sections(self) -> List[Dict[str, Any]]:
"""Get documentation sections"""
return self.data.get("sections", [])
@property
def generation_config(self) -> Dict[str, Any]:
"""Get generation configuration"""
return self.data.get("generation", {})
@property
def output_config(self) -> Dict[str, Any]:
"""Get output configuration"""
return self.data.get("output", {})
@property
def schedule_config(self) -> Dict[str, Any]:
"""Get schedule configuration"""
return self.data.get("schedule", {})
class TemplateBasedGenerator(BaseGenerator):
"""
Generator that uses YAML templates to generate sectioned documentation
This prevents LLM context overload by:
1. Loading templates that define sections
2. Generating each section independently
3. Using only required data for each section
"""
def __init__(self, template_path: str):
"""
Initialize template-based generator
Args:
template_path: Path to YAML template file
"""
self.template = DocumentationTemplate(Path(template_path))
super().__init__(
name=self.template.collector, section=f"{self.template.collector}_docs"
)
async def generate(self, data: Dict[str, Any]) -> str:
"""
Generate complete documentation using template
This method orchestrates the generation of all sections.
Args:
data: Collected infrastructure data
Returns:
Combined documentation (all sections)
"""
self.logger.info(
f"Generating documentation for {self.template.name} using template"
)
# Validate data matches template collector
collector_name = data.get("metadata", {}).get("collector", "")
if collector_name != self.template.collector:
self.logger.warning(
f"Data collector ({collector_name}) doesn't match template ({self.template.collector})"
)
# Generate each section
sections_content = []
for section_def in self.template.sections:
section_content = await self.generate_section(section_def, data)
if section_content:
sections_content.append(section_content)
# Combine all sections
combined_doc = self._combine_sections(sections_content)
return combined_doc
async def generate_section(
self, section_def: Dict[str, Any], full_data: Dict[str, Any]
) -> Optional[str]:
"""
Generate a single documentation section
Args:
section_def: Section definition from template
full_data: Complete collected data
Returns:
Generated section content in Markdown
"""
section_id = section_def.get("id", "unknown")
section_title = section_def.get("title", "Untitled Section")
data_requirements = section_def.get("data_requirements", [])
prompt_template = section_def.get("prompt_template", "")
self.logger.info(f"Generating section: {section_title}")
# Extract only required data for this section
section_data = self._extract_section_data(full_data, data_requirements)
# Build prompt by substituting placeholders
prompt = self._build_prompt(prompt_template, section_data)
# Get generation config
gen_config = self.template.generation_config
temperature = gen_config.get("temperature", 0.7)
max_tokens = gen_config.get("max_tokens", 4000)
# System prompt for documentation generation
system_prompt = """You are a technical documentation expert specializing in datacenter infrastructure.
Generate clear, accurate, and well-structured documentation in Markdown format.
Guidelines:
- Use proper Markdown formatting (headers, tables, lists, code blocks)
- Be precise and factual based on provided data
- Include practical examples and recommendations
- Use tables for structured data
- Use bullet points for lists
- Use code blocks for commands/configurations
- Organize content with clear sections
- Write in a professional but accessible tone
"""
try:
# Generate content using LLM
content = await self.generate_with_llm(
system_prompt=system_prompt,
user_prompt=prompt,
temperature=temperature,
max_tokens=max_tokens,
)
# Add section header
section_content = f"# {section_title}\n\n{content}\n\n"
self.logger.info(f"✓ Section '{section_title}' generated successfully")
return section_content
except Exception as e:
self.logger.error(f"Failed to generate section '{section_title}': {e}")
return None
def _extract_section_data(
self, full_data: Dict[str, Any], requirements: List[str]
) -> Dict[str, Any]:
"""
Extract only required data for a section
Args:
full_data: Complete collected data
requirements: List of required data keys
Returns:
Dictionary with only required data
"""
section_data = {}
data_section = full_data.get("data", {})
for req in requirements:
if req in data_section:
section_data[req] = data_section[req]
else:
self.logger.warning(f"Required data '{req}' not found in collected data")
section_data[req] = None
return section_data
def _build_prompt(self, template: str, data: Dict[str, Any]) -> str:
"""
Build prompt by substituting data into template
Args:
template: Prompt template with {placeholders}
data: Data to substitute
Returns:
Completed prompt
"""
prompt = template
# Replace each placeholder with formatted data
for key, value in data.items():
placeholder = f"{{{key}}}"
if placeholder in prompt:
# Format data for prompt
formatted_value = self._format_data_for_prompt(value)
prompt = prompt.replace(placeholder, formatted_value)
return prompt
def _format_data_for_prompt(self, data: Any) -> str:
"""
Format data for inclusion in LLM prompt
Args:
data: Data to format (dict, list, str, etc.)
Returns:
Formatted string representation
"""
if data is None:
return "No data available"
if isinstance(data, (dict, list)):
# Pretty print JSON for structured data
try:
return json.dumps(data, indent=2, default=str)
except Exception:
return str(data)
return str(data)
def _combine_sections(self, sections: List[str]) -> str:
"""
Combine all sections into a single document
Args:
sections: List of section contents
Returns:
Combined markdown document
"""
# Add document header
header = f"""# {self.template.name} Documentation
*Generated automatically from infrastructure data*
---
"""
# Add table of contents
toc = "## Table of Contents\n\n"
for i, section in enumerate(sections, 1):
# Extract section title from first line
lines = section.strip().split("\n")
if lines:
title = lines[0].replace("#", "").strip()
toc += f"{i}. [{title}](#{title.lower().replace(' ', '-')})\n"
toc += "\n---\n\n"
# Combine all parts
combined = header + toc + "\n".join(sections)
return combined
async def generate_and_save_sections(
self, data: Dict[str, Any], save_individually: bool = True
) -> List[Dict[str, Any]]:
"""
Generate and save each section individually
This is useful for very large documentation where you want each
section as a separate file.
Args:
data: Collected infrastructure data
save_individually: Save each section as separate file
Returns:
List of results for each section
"""
results = []
output_config = self.template.output_config
output_dir = output_config.get("directory", "output")
save_to_db = output_config.get("save_to_database", True)
save_to_file = output_config.get("save_to_file", True)
for section_def in self.template.sections:
section_id = section_def.get("id")
section_title = section_def.get("title")
# Generate section
content = await self.generate_section(section_def, data)
if not content:
results.append(
{
"section_id": section_id,
"success": False,
"error": "Generation failed",
}
)
continue
result = {
"section_id": section_id,
"title": section_title,
"success": True,
"content": content,
}
# Save section if requested
if save_individually:
if save_to_file:
# Save to file
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
filename = f"{section_id}.md"
file_path = output_path / filename
file_path.write_text(content, encoding="utf-8")
result["file_path"] = str(file_path)
self.logger.info(f"Saved section to: {file_path}")
if save_to_db:
# Save to database
metadata = {
"section_id": section_id,
"template": str(self.template.path),
"category": section_def.get("category", ""),
}
# Create temporary generator for this section
temp_gen = BaseGenerator.__new__(BaseGenerator)
temp_gen.name = self.name
temp_gen.section = section_id
temp_gen.logger = self.logger
temp_gen.llm = self.llm
await temp_gen.save_to_database(content, metadata)
results.append(result)
return results
async def example_usage() -> None:
"""Example of using template-based generator"""
from datacenter_docs.collectors.proxmox_collector import ProxmoxCollector
# Collect data
collector = ProxmoxCollector()
collect_result = await collector.run()
if not collect_result["success"]:
print(f"❌ Collection failed: {collect_result['error']}")
return
# Generate documentation using template
template_path = "templates/documentation/proxmox.yaml"
generator = TemplateBasedGenerator(template_path)
# Generate and save all sections
sections_results = await generator.generate_and_save_sections(
data=collect_result["data"], save_individually=True
)
# Print results
for result in sections_results:
if result["success"]:
print(f"✅ Section '{result['title']}' generated successfully")
else:
print(f"❌ Section '{result.get('section_id')}' failed: {result.get('error')}")
if __name__ == "__main__":
import asyncio
asyncio.run(example_usage())

View File

@@ -72,6 +72,20 @@ class Settings(BaseSettings):
CELERY_BROKER_URL: str = "redis://localhost:6379/0"
CELERY_RESULT_BACKEND: str = "redis://localhost:6379/0"
# Additional Port Configuration (for Docker services)
MONGODB_PORT: int = 27017
REDIS_PORT: int = 6379
CHAT_PORT: int = 8001
FLOWER_PORT: int = 5555
FRONTEND_PORT: int = 8080
# MongoDB Root Credentials (for Docker initialization)
MONGO_ROOT_USER: str = "admin"
MONGO_ROOT_PASSWORD: str = "admin123"
# Redis Password
REDIS_PASSWORD: str = ""
@model_validator(mode="before")
@classmethod
def set_celery_defaults(cls, values: Dict[str, Any]) -> Dict[str, Any]:

View File

@@ -35,6 +35,7 @@ celery_app = Celery(
backend=settings.CELERY_RESULT_BACKEND,
include=[
"datacenter_docs.workers.tasks",
"datacenter_docs.workers.documentation_tasks",
],
)
@@ -67,6 +68,11 @@ celery_app.conf.update(
"queue": "data_collection"
},
"datacenter_docs.workers.tasks.cleanup_old_data_task": {"queue": "maintenance"},
"collect_and_generate_docs": {"queue": "documentation"},
"generate_proxmox_docs": {"queue": "documentation"},
"generate_all_docs": {"queue": "documentation"},
"index_generated_docs": {"queue": "documentation"},
"full_docs_pipeline": {"queue": "documentation"},
},
# Task rate limits
task_annotations={
@@ -77,10 +83,28 @@ celery_app.conf.update(
},
# Beat schedule (periodic tasks)
beat_schedule={
# Generate Proxmox documentation daily at 2 AM
"generate-proxmox-docs-daily": {
"task": "generate_proxmox_docs",
"schedule": crontab(minute=0, hour=2), # Daily at 2 AM
"options": {"queue": "documentation"},
},
# Generate all documentation every 6 hours
"generate-all-docs-every-6h": {
"task": "generate_all_docs",
"schedule": crontab(minute=30, hour="*/6"), # Every 6 hours at :30
"options": {"queue": "documentation"},
},
# Full documentation pipeline weekly
"full-docs-pipeline-weekly": {
"task": "full_docs_pipeline",
"schedule": crontab(minute=0, hour=3, day_of_week=0), # Sunday at 3 AM
"options": {"queue": "documentation"},
},
# Legacy tasks (keep for backward compatibility)
"generate-all-docs-legacy": {
"task": "datacenter_docs.workers.tasks.generate_documentation_task",
"schedule": crontab(minute=0, hour="*/6"), # Every 6 hours
"schedule": crontab(minute=0, hour="*/12"), # Every 12 hours
"args": (),
"options": {"queue": "documentation"},
},

View File

@@ -0,0 +1,347 @@
"""
Celery Tasks for Documentation Generation
Scheduled tasks for collecting data and generating documentation
from infrastructure systems (Proxmox, VMware, Kubernetes, etc.)
"""
import logging
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, List
from celery import group
from datacenter_docs.workers.celery_app import celery_app
logger = logging.getLogger(__name__)
@celery_app.task(name="collect_and_generate_docs", bind=True)
def collect_and_generate_docs(
self, collector_name: str, template_path: str
) -> Dict[str, Any]:
"""
Collect data from infrastructure and generate documentation
Args:
collector_name: Name of collector to use (e.g., 'proxmox', 'vmware')
template_path: Path to documentation template YAML file
Returns:
Result dictionary with status and details
"""
import asyncio
task_id = self.request.id
logger.info(
f"[{task_id}] Starting documentation generation: {collector_name} -> {template_path}"
)
result = {
"task_id": task_id,
"collector": collector_name,
"template": template_path,
"success": False,
"started_at": datetime.now().isoformat(),
"completed_at": None,
"error": None,
"sections_generated": 0,
"sections_failed": 0,
}
try:
# Run async collection and generation
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
generation_result = loop.run_until_complete(
_async_collect_and_generate(collector_name, template_path)
)
loop.close()
# Update result
result.update(generation_result)
result["success"] = True
result["completed_at"] = datetime.now().isoformat()
logger.info(
f"[{task_id}] Documentation generation completed: "
f"{result['sections_generated']} sections generated, "
f"{result['sections_failed']} failed"
)
except Exception as e:
result["error"] = str(e)
result["completed_at"] = datetime.now().isoformat()
logger.error(f"[{task_id}] Documentation generation failed: {e}", exc_info=True)
return result
async def _async_collect_and_generate(
collector_name: str, template_path: str
) -> Dict[str, Any]:
"""
Async implementation of collect and generate workflow
Args:
collector_name: Collector name
template_path: Template path
Returns:
Generation result
"""
from datacenter_docs.generators.template_generator import TemplateBasedGenerator
# Import appropriate collector
collector = await _get_collector(collector_name)
# Collect data
logger.info(f"Collecting data with {collector_name} collector...")
collect_result = await collector.run()
if not collect_result["success"]:
raise Exception(f"Data collection failed: {collect_result.get('error')}")
collected_data = collect_result["data"]
# Generate documentation using template
logger.info(f"Generating documentation using template: {template_path}")
generator = TemplateBasedGenerator(template_path)
sections_results = await generator.generate_and_save_sections(
data=collected_data, save_individually=True
)
# Count successes and failures
sections_generated = sum(1 for r in sections_results if r.get("success"))
sections_failed = sum(1 for r in sections_results if not r.get("success"))
return {
"sections_generated": sections_generated,
"sections_failed": sections_failed,
"sections": sections_results,
"collector_stats": collect_result["data"].get("data", {}).get("statistics", {}),
}
async def _get_collector(collector_name: str) -> Any:
"""
Get collector instance by name
Args:
collector_name: Name of collector
Returns:
Collector instance
"""
from datacenter_docs.collectors.kubernetes_collector import KubernetesCollector
from datacenter_docs.collectors.proxmox_collector import ProxmoxCollector
from datacenter_docs.collectors.vmware_collector import VMwareCollector
collectors = {
"proxmox": ProxmoxCollector,
"vmware": VMwareCollector,
"kubernetes": KubernetesCollector,
}
if collector_name not in collectors:
raise ValueError(
f"Unknown collector: {collector_name}. Available: {list(collectors.keys())}"
)
return collectors[collector_name]()
@celery_app.task(name="generate_proxmox_docs")
def generate_proxmox_docs() -> Dict[str, Any]:
"""
Scheduled task to generate Proxmox documentation
This task is scheduled via Celery Beat to run daily.
Returns:
Task result
"""
logger.info("Scheduled Proxmox documentation generation started")
template_path = "templates/documentation/proxmox.yaml"
return collect_and_generate_docs(collector_name="proxmox", template_path=template_path)
@celery_app.task(name="generate_all_docs")
def generate_all_docs() -> Dict[str, Any]:
"""
Generate documentation for all configured systems
This creates parallel tasks for each system.
Returns:
Result with task IDs
"""
logger.info("Starting documentation generation for all systems")
# Define all systems and their templates
systems = [
{"collector": "proxmox", "template": "templates/documentation/proxmox.yaml"},
# Add more as templates are created:
# {"collector": "vmware", "template": "templates/documentation/vmware.yaml"},
# {"collector": "kubernetes", "template": "templates/documentation/k8s.yaml"},
]
# Create parallel tasks
task_group = group(
[
collect_and_generate_docs.s(system["collector"], system["template"])
for system in systems
]
)
# Execute group
result = task_group.apply_async()
return {
"task_group_id": result.id,
"systems": len(systems),
"message": "Documentation generation started for all systems",
}
@celery_app.task(name="index_generated_docs")
def index_generated_docs(output_dir: str = "output") -> Dict[str, Any]:
"""
Index all generated documentation into vector store for RAG
This task should run after documentation generation to make
the new docs searchable in the chat interface.
Args:
output_dir: Directory containing generated markdown files
Returns:
Indexing result
"""
import asyncio
logger.info(f"Starting documentation indexing from {output_dir}")
result = {
"success": False,
"files_indexed": 0,
"chunks_created": 0,
"error": None,
}
try:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
index_result = loop.run_until_complete(_async_index_docs(output_dir))
loop.close()
result.update(index_result)
result["success"] = True
logger.info(
f"Documentation indexing completed: {result['files_indexed']} files, "
f"{result['chunks_created']} chunks"
)
except Exception as e:
result["error"] = str(e)
logger.error(f"Documentation indexing failed: {e}", exc_info=True)
return result
async def _async_index_docs(output_dir: str) -> Dict[str, Any]:
"""
Async implementation of documentation indexing
Args:
output_dir: Output directory with markdown files
Returns:
Indexing result
"""
from datacenter_docs.chat.agent import DocumentationAgent
agent = DocumentationAgent()
# Index all markdown files in output directory
docs_path = Path(output_dir)
if not docs_path.exists():
raise FileNotFoundError(f"Output directory not found: {output_dir}")
await agent.index_documentation(docs_path)
# Count indexed files and chunks
# (This is a simplified version, actual implementation would track this better)
md_files = list(docs_path.glob("**/*.md"))
files_indexed = len(md_files)
# Estimate chunks (roughly 1000 chars per chunk, 200 overlap)
total_chars = sum(f.stat().st_size for f in md_files)
chunks_created = total_chars // 800 # Rough estimate
return {"files_indexed": files_indexed, "chunks_created": chunks_created}
@celery_app.task(name="full_docs_pipeline")
def full_docs_pipeline() -> Dict[str, Any]:
"""
Full documentation pipeline: collect -> generate -> index
This is the master task that orchestrates the entire workflow.
Returns:
Pipeline result
"""
logger.info("Starting full documentation pipeline")
# Step 1: Generate all documentation
generate_result = generate_all_docs()
# Step 2: Wait a bit for generation to complete, then index
# (In production, this would use Celery chains/chords for better coordination)
from celery import chain
pipeline = chain(
generate_all_docs.s(),
index_generated_docs.s("output"),
)
result = pipeline.apply_async()
return {
"pipeline_id": result.id,
"message": "Full documentation pipeline started",
"steps": ["generate_all_docs", "index_generated_docs"],
}
# Periodic task configuration (if using Celery Beat)
# Add to celery_app.py or separate beat configuration:
"""
from celery.schedules import crontab
celery_app.conf.beat_schedule = {
'generate-proxmox-docs-daily': {
'task': 'generate_proxmox_docs',
'schedule': crontab(hour=2, minute=0), # Daily at 2 AM
},
'generate-all-docs-daily': {
'task': 'generate_all_docs',
'schedule': crontab(hour=2, minute=30), # Daily at 2:30 AM
},
'full-docs-pipeline-weekly': {
'task': 'full_docs_pipeline',
'schedule': crontab(hour=3, minute=0, day_of_week=0), # Weekly on Sunday at 3 AM
},
}
"""