Files
llm-automation-docs-and-rem…/CLAUDE.md
dnviti 52655e9eee
Some checks failed
CI/CD Pipeline / Generate Documentation (push) Successful in 4m57s
CI/CD Pipeline / Lint Code (push) Successful in 5m33s
CI/CD Pipeline / Run Tests (push) Successful in 4m20s
CI/CD Pipeline / Security Scanning (push) Successful in 4m32s
CI/CD Pipeline / Build and Push Docker Images (chat) (push) Failing after 49s
CI/CD Pipeline / Build and Push Docker Images (frontend) (push) Failing after 48s
CI/CD Pipeline / Build and Push Docker Images (worker) (push) Failing after 46s
CI/CD Pipeline / Build and Push Docker Images (api) (push) Failing after 40s
CI/CD Pipeline / Deploy to Staging (push) Has been skipped
CI/CD Pipeline / Deploy to Production (push) Has been skipped
feat: Implement CLI tool, Celery workers, and VMware collector
Complete implementation of core MVP components:

CLI Tool (src/datacenter_docs/cli.py):
- 11 commands for system management (serve, worker, init-db, generate, etc.)
- Auto-remediation policy management (enable/disable/status)
- System statistics and monitoring
- Rich formatted output with tables and panels

Celery Workers (src/datacenter_docs/workers/):
- celery_app.py with 4 specialized queues (documentation, auto_remediation, data_collection, maintenance)
- tasks.py with 8 async tasks integrated with MongoDB/Beanie
- Celery Beat scheduling (6h docs, 1h data collection, 15m metrics, 2am cleanup)
- Rate limiting (10 auto-remediation/h) and timeout configuration
- Task lifecycle signals and comprehensive logging

VMware Collector (src/datacenter_docs/collectors/):
- BaseCollector abstract class with full workflow (connect/collect/validate/store/disconnect)
- VMwareCollector for vSphere infrastructure data collection
- Collects VMs, ESXi hosts, clusters, datastores, networks with statistics
- MCP client integration with mock data fallback for development
- MongoDB storage via AuditLog and data validation

Documentation & Configuration:
- Updated README.md with CLI commands and Workers sections
- Updated TODO.md with project status (55% completion)
- Added CLAUDE.md with comprehensive project instructions
- Added Docker compose setup for development environment

Project Status:
- Completion: 50% -> 55%
- MVP Milestone: 80% complete (only Infrastructure Generator remaining)
- Estimated time to MVP: 1-2 days

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 22:29:59 +02:00

13 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.


Project Overview

LLM Automation - Docs & Remediation Engine: AI-powered datacenter documentation generation with autonomous problem resolution capabilities. The system uses LLMs to automatically generate infrastructure documentation and can autonomously execute remediation actions on datacenter infrastructure.

Current Status: ~35% complete - Infrastructure and API are functional, but CLI tool, Celery workers, collectors, and generators are not yet implemented.

Language: Python 3.12 (standardized across entire project)

Database: MongoDB with Beanie ODM (async, document-based)


Essential Commands

Development Environment Setup

# Install dependencies
poetry install

# Start Docker development stack (6 services: MongoDB, Redis, API, Chat, Worker, Frontend)
cd deploy/docker
docker-compose -f docker-compose.dev.yml up --build -d

# Check service status
docker-compose -f docker-compose.dev.yml ps

# View logs
docker-compose -f docker-compose.dev.yml logs -f api
docker-compose -f docker-compose.dev.yml logs -f --tail=50 api

# Stop services
docker-compose -f docker-compose.dev.yml down

# Restart single service after code changes
docker-compose -f docker-compose.dev.yml restart api

Testing & Code Quality

# Run all tests
poetry run pytest

# Run specific test file
poetry run pytest tests/test_reliability.py

# Run with coverage
poetry run pytest --cov=src/datacenter_docs --cov-report=html

# Linting
poetry run black src/
poetry run ruff check src/
poetry run mypy src/

# Format code (100 char line length)
poetry run black src/ tests/

Running Services Locally

# API server (development with auto-reload)
poetry run uvicorn datacenter_docs.api.main:app --reload --host 0.0.0.0 --port 8000

# CLI tool (NOT YET IMPLEMENTED - needs src/datacenter_docs/cli.py)
poetry run datacenter-docs --help

# Celery worker (NOT YET IMPLEMENTED - needs src/datacenter_docs/workers/)
poetry run docs-worker

# Chat server (NOT YET IMPLEMENTED - needs src/datacenter_docs/chat/main.py)
poetry run docs-chat

Database Operations

# Access MongoDB shell in Docker
docker exec -it datacenter-docs-mongodb-dev mongosh -u admin -p admin123

# Access Redis CLI
docker exec -it datacenter-docs-redis-dev redis-cli

# Check database connectivity
curl http://localhost:8000/health

High-Level Architecture

1. LLM Provider System (OpenAI-Compatible API)

Location: src/datacenter_docs/utils/llm_client.py

Key Concept: All LLM interactions go through LLMClient which uses the OpenAI SDK and can connect to ANY OpenAI-compatible provider:

  • OpenAI (GPT-4, GPT-3.5)
  • Anthropic Claude (via OpenAI-compatible endpoint)
  • LLMStudio (local models)
  • Open-WebUI (local models)
  • Ollama (local models)

Configuration (in .env):

LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-your-key
LLM_MODEL=gpt-4-turbo-preview

Usage:

from datacenter_docs.utils.llm_client import get_llm_client

llm = get_llm_client()
response = await llm.chat_completion(messages=[...])
json_response = await llm.generate_json(messages=[...])

2. Database Architecture (MongoDB + Beanie ODM)

Location: src/datacenter_docs/api/models.py

Key Characteristics:

  • Models inherit from beanie.Document
  • MongoDB atomic operations
  • Async operations: await Ticket.find_one(), await ticket.save()
  • ObjectId for primary keys: PydanticObjectId
  • Supports embedded documents and references

Example:

from beanie import Document, PydanticObjectId
from datetime import datetime

class Ticket(Document):
    ticket_id: str
    status: TicketStatus
    created_at: datetime = datetime.now()

    class Settings:
        name = "tickets"  # Collection name
        indexes = ["ticket_id", "status"]

# Usage
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")
ticket.status = TicketStatus.RESOLVED
await ticket.save()

3. Auto-Remediation Decision Flow

Multi-layered safety system that decides whether AI can execute infrastructure changes.

Flow (src/datacenter_docs/api/reliability.pyauto_remediation.py):

Ticket Created
    ↓
ReliabilityCalculator.calculate_reliability()
    ├─ AI Confidence Score (25%)
    ├─ Human Feedback History (30%)
    ├─ Historical Success Rate (25%)
    └─ Pattern Matching (20%)
    ↓
Overall Reliability Score (0-100%)
    ↓
AutoRemediationDecisionEngine.should_execute()
    ├─ Check if enabled for ticket
    ├─ Check minimum reliability (85%)
    ├─ Check action risk level
    ├─ Check rate limits
    └─ Determine if approval needed
    ↓
AutoRemediationEngine.execute_remediation()
    ├─ Pre-execution checks
    ├─ Execute via MCP Client
    ├─ Post-execution validation
    └─ Log everything

Key Classes:

  • ReliabilityCalculator: Calculates weighted reliability score
  • AutoRemediationDecisionEngine: Decides if/how to execute
  • AutoRemediationEngine: Actually executes actions via MCP

4. MCP Client Integration

Location: src/datacenter_docs/mcp/client.py

MCP (Model Context Protocol) is the bridge to infrastructure. It's an external service that connects to VMware, Kubernetes, network devices, etc.

Important: MCP Client is EXTERNAL. We don't implement the infrastructure connections - we call MCP's API.

Operations:

  • Read operations: Get VM status, list pods, check network config
  • Write operations (auto-remediation): Restart VM, scale deployment, enable port

5. Documentation Agent (Agentic AI)

Location: src/datacenter_docs/chat/agent.py

Architecture Pattern: RAG (Retrieval Augmented Generation)

User Query
    ↓
Vector Search (ChromaDB + HuggingFace embeddings)
    ↓
Retrieve Top-K Relevant Docs
    ↓
Build Context + Query → LLM
    ↓
Generate Response with Citations

Key Methods:

  • search_documentation(): Semantic search in vector store
  • resolve_ticket(): Analyze problem + suggest resolution
  • chat_with_context(): Conversational interface with doc search

6. Missing Critical Components (TODO)

See TODO.md for comprehensive list. When implementing new features, check TODO.md first.

High Priority Missing Components:

  1. CLI Tool (src/datacenter_docs/cli.py):

    • Entry point: datacenter-docs command
    • Uses Typer + Rich for CLI
    • Commands: generate, serve, worker, init-db, stats
  2. Celery Workers (src/datacenter_docs/workers/):

    • celery_app.py: Celery configuration
    • tasks.py: Async tasks (documentation generation, auto-remediation execution)
    • Background task processing
  3. Collectors (src/datacenter_docs/collectors/):

    • Base class exists, implementations missing
    • Need: VMware, Kubernetes, Network, Storage collectors
    • Pattern: async def collect() -> dict
  4. Generators (src/datacenter_docs/generators/):

    • Base class exists, implementations missing
    • Need: Infrastructure, Network, Virtualization generators
    • Pattern: async def generate(data: dict) -> str (returns Markdown)

When implementing these:

  • Follow existing patterns in base classes
  • Use LLMClient for AI generation
  • Use MCPClient for infrastructure data collection
  • All operations are async
  • Use MongoDB/Beanie for storage

Code Patterns & Conventions

Async/Await

All operations use asyncio:

async def my_function():
    result = await some_async_call()

Type Hints

Type hints are required (mypy configured strictly):

async def process_ticket(ticket_id: str) -> Dict[str, Any]:
    ...

Logging

Use structured logging with module-level logger:

import logging

logger = logging.getLogger(__name__)

logger.info(f"Processing ticket {ticket_id}")
logger.error(f"Failed to execute action: {e}", exc_info=True)

Configuration

All config via src/datacenter_docs/utils/config.py using Pydantic Settings:

from datacenter_docs.utils.config import get_settings

settings = get_settings()
mongodb_url = settings.MONGODB_URL
llm_model = settings.LLM_MODEL

Error Handling

try:
    result = await risky_operation()
except SpecificException as e:
    logger.error(f"Operation failed: {e}", exc_info=True)
    return {"success": False, "error": str(e)}

Docker Development Workflow

Primary development environment: Docker Compose

Services in deploy/docker/docker-compose.dev.yml:

  • mongodb: MongoDB 7 (port 27017)
  • redis: Redis 7 (port 6379)
  • api: FastAPI service (port 8000)
  • chat: WebSocket chat server (port 8001) - NOT IMPLEMENTED
  • worker: Celery worker - NOT IMPLEMENTED
  • frontend: React + Nginx (port 80) - MINIMAL

Development cycle:

  1. Edit code in src/
  2. Rebuild and restart affected service: docker-compose -f docker-compose.dev.yml up --build -d api
  3. Check logs: docker-compose -f docker-compose.dev.yml logs -f api
  4. Test: Access http://localhost:8000/api/docs

Volume mounts: Source code is mounted, so changes are reflected (except for dependency changes which need rebuild).


CI/CD Pipelines

Three CI/CD systems configured (all use Python 3.12):

  • .github/workflows/build-deploy.yml: GitHub Actions
  • .gitlab-ci.yml: GitLab CI
  • .gitea/workflows/ci.yml: Gitea Actions

Pipeline stages:

  1. Lint (Black, Ruff)
  2. Type check (mypy)
  3. Test (pytest)
  4. Build Docker image
  5. Deploy (if on main branch)

When modifying Python version: Update ALL three pipeline files.


Key Files Reference

Core Application:

  • src/datacenter_docs/api/main.py: FastAPI application entry point
  • src/datacenter_docs/api/models.py: MongoDB/Beanie models (all data structures)
  • src/datacenter_docs/utils/config.py: Configuration management
  • src/datacenter_docs/utils/llm_client.py: LLM provider abstraction

Auto-Remediation:

  • src/datacenter_docs/api/reliability.py: Reliability scoring and decision engine
  • src/datacenter_docs/api/auto_remediation.py: Execution engine with safety checks

Infrastructure Integration:

  • src/datacenter_docs/mcp/client.py: MCP protocol client
  • src/datacenter_docs/chat/agent.py: Documentation AI agent (RAG)

Configuration:

  • .env.example: Template with ALL config options (including LLM provider examples)
  • pyproject.toml: Dependencies, scripts, linting config (Black 100 char, Python 3.12)

Documentation:

  • README.md: User-facing documentation
  • TODO.md: CRITICAL - Current project status, missing components, roadmap
  • deploy/docker/README.md: Docker environment guide

Important Notes

Python Version

Use Python 3.12 (standardized across the project).

Database Queries

MongoDB queries look different from SQL:

# Find
tickets = await Ticket.find(Ticket.status == TicketStatus.PENDING).to_list()

# Find one
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")

# Update
ticket.status = TicketStatus.RESOLVED
await ticket.save()

# Complex query
tickets = await Ticket.find(
    Ticket.created_at > datetime.now() - timedelta(days=7),
    Ticket.category == "network"
).to_list()

LLM API Calls

Use the generic client:

from datacenter_docs.utils.llm_client import get_llm_client

llm = get_llm_client()
response = await llm.chat_completion(messages=[...])

Auto-Remediation Safety

When implementing new remediation actions:

  1. Define action in RemediationAction model
  2. Set appropriate ActionRiskLevel (low/medium/high/critical)
  3. Implement pre/post validation checks
  4. Add comprehensive logging
  5. Test with dry_run=True first

Testing

Tests are minimal currently. When adding tests:

  • Use pytest-asyncio for async tests
  • Mock MCP client and LLM client
  • Test reliability calculations thoroughly
  • Test safety checks in auto-remediation

When Implementing New Features

  1. Check TODO.md first - component might be partially implemented
  2. Follow existing patterns in similar components
  3. Use type hints (mypy is strict)
  4. Use LLMClient for AI operations
  5. Use Beanie ORM for database operations
  6. All operations are async (use async/await)
  7. Test in Docker (primary development environment)
  8. Update TODO.md when marking components as completed

Questions? Check These Files

  • "How do I configure the LLM provider?".env.example, utils/config.py, utils/llm_client.py
  • "How does auto-remediation work?"api/reliability.py, api/auto_remediation.py
  • "What's not implemented yet?"TODO.md (comprehensive list with estimates)
  • "How do I run tests/lint?"pyproject.toml (all commands), this file
  • "Database schema?"api/models.py (all Beanie models)
  • "Docker services?"deploy/docker/docker-compose.dev.yml, deploy/docker/README.md
  • "API endpoints?"api/main.py, or http://localhost:8000/api/docs when running

Last Updated: 2025-10-19 Project Status: 35% complete (Infrastructure done, business logic pending) Next Priority: CLI tool → Celery workers → Collectors → Generators