it-ops/llm-automation-docs-and-remediation-engine

Fork 0

Files

dnviti 52655e9eee

CI/CD Pipeline / Generate Documentation (push) Successful in 4m57s

Details

CI/CD Pipeline / Lint Code (push) Successful in 5m33s

Details

CI/CD Pipeline / Run Tests (push) Successful in 4m20s

Details

CI/CD Pipeline / Security Scanning (push) Successful in 4m32s

Details

CI/CD Pipeline / Build and Push Docker Images (chat) (push) Failing after 49s

Details

CI/CD Pipeline / Build and Push Docker Images (frontend) (push) Failing after 48s

Details

CI/CD Pipeline / Build and Push Docker Images (worker) (push) Failing after 46s

Details

CI/CD Pipeline / Build and Push Docker Images (api) (push) Failing after 40s

Details

CI/CD Pipeline / Deploy to Staging (push) Has been skipped

Details

CI/CD Pipeline / Deploy to Production (push) Has been skipped

Details

feat: Implement CLI tool, Celery workers, and VMware collector

Complete implementation of core MVP components:

CLI Tool (src/datacenter_docs/cli.py):
- 11 commands for system management (serve, worker, init-db, generate, etc.)
- Auto-remediation policy management (enable/disable/status)
- System statistics and monitoring
- Rich formatted output with tables and panels

Celery Workers (src/datacenter_docs/workers/):
- celery_app.py with 4 specialized queues (documentation, auto_remediation, data_collection, maintenance)
- tasks.py with 8 async tasks integrated with MongoDB/Beanie
- Celery Beat scheduling (6h docs, 1h data collection, 15m metrics, 2am cleanup)
- Rate limiting (10 auto-remediation/h) and timeout configuration
- Task lifecycle signals and comprehensive logging

VMware Collector (src/datacenter_docs/collectors/):
- BaseCollector abstract class with full workflow (connect/collect/validate/store/disconnect)
- VMwareCollector for vSphere infrastructure data collection
- Collects VMs, ESXi hosts, clusters, datastores, networks with statistics
- MCP client integration with mock data fallback for development
- MongoDB storage via AuditLog and data validation

Documentation & Configuration:
- Updated README.md with CLI commands and Workers sections
- Updated TODO.md with project status (55% completion)
- Added CLAUDE.md with comprehensive project instructions
- Added Docker compose setup for development environment

Project Status:
- Completion: 50% -> 55%
- MVP Milestone: 80% complete (only Infrastructure Generator remaining)
- Estimated time to MVP: 1-2 days

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-19 22:29:59 +02:00

13 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

LLM Automation - Docs & Remediation Engine: AI-powered datacenter documentation generation with autonomous problem resolution capabilities. The system uses LLMs to automatically generate infrastructure documentation and can autonomously execute remediation actions on datacenter infrastructure.

Current Status: ~35% complete - Infrastructure and API are functional, but CLI tool, Celery workers, collectors, and generators are not yet implemented.

Language: Python 3.12 (standardized across entire project)

Database: MongoDB with Beanie ODM (async, document-based)

Essential Commands

Development Environment Setup

# Install dependencies
poetry install

# Start Docker development stack (6 services: MongoDB, Redis, API, Chat, Worker, Frontend)
cd deploy/docker
docker-compose -f docker-compose.dev.yml up --build -d

# Check service status
docker-compose -f docker-compose.dev.yml ps

# View logs
docker-compose -f docker-compose.dev.yml logs -f api
docker-compose -f docker-compose.dev.yml logs -f --tail=50 api

# Stop services
docker-compose -f docker-compose.dev.yml down

# Restart single service after code changes
docker-compose -f docker-compose.dev.yml restart api

Testing & Code Quality

# Run all tests
poetry run pytest

# Run specific test file
poetry run pytest tests/test_reliability.py

# Run with coverage
poetry run pytest --cov=src/datacenter_docs --cov-report=html

# Linting
poetry run black src/
poetry run ruff check src/
poetry run mypy src/

# Format code (100 char line length)
poetry run black src/ tests/

Running Services Locally

# API server (development with auto-reload)
poetry run uvicorn datacenter_docs.api.main:app --reload --host 0.0.0.0 --port 8000

# CLI tool (NOT YET IMPLEMENTED - needs src/datacenter_docs/cli.py)
poetry run datacenter-docs --help

# Celery worker (NOT YET IMPLEMENTED - needs src/datacenter_docs/workers/)
poetry run docs-worker

# Chat server (NOT YET IMPLEMENTED - needs src/datacenter_docs/chat/main.py)
poetry run docs-chat

Database Operations

# Access MongoDB shell in Docker
docker exec -it datacenter-docs-mongodb-dev mongosh -u admin -p admin123

# Access Redis CLI
docker exec -it datacenter-docs-redis-dev redis-cli

# Check database connectivity
curl http://localhost:8000/health

High-Level Architecture

1. LLM Provider System (OpenAI-Compatible API)

Location: src/datacenter_docs/utils/llm_client.py

Key Concept: All LLM interactions go through LLMClient which uses the OpenAI SDK and can connect to ANY OpenAI-compatible provider:

OpenAI (GPT-4, GPT-3.5)
Anthropic Claude (via OpenAI-compatible endpoint)
LLMStudio (local models)
Open-WebUI (local models)
Ollama (local models)

Configuration (in .env):

LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-your-key
LLM_MODEL=gpt-4-turbo-preview

Usage:

from datacenter_docs.utils.llm_client import get_llm_client

llm = get_llm_client()
response = await llm.chat_completion(messages=[...])
json_response = await llm.generate_json(messages=[...])

2. Database Architecture (MongoDB + Beanie ODM)

Location: src/datacenter_docs/api/models.py

Key Characteristics:

Models inherit from beanie.Document
MongoDB atomic operations
Async operations: await Ticket.find_one(), await ticket.save()
ObjectId for primary keys: PydanticObjectId
Supports embedded documents and references

Example:

from beanie import Document, PydanticObjectId
from datetime import datetime

class Ticket(Document):
    ticket_id: str
    status: TicketStatus
    created_at: datetime = datetime.now()

    class Settings:
        name = "tickets"  # Collection name
        indexes = ["ticket_id", "status"]

# Usage
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")
ticket.status = TicketStatus.RESOLVED
await ticket.save()

3. Auto-Remediation Decision Flow

Multi-layered safety system that decides whether AI can execute infrastructure changes.

Flow (src/datacenter_docs/api/reliability.py → auto_remediation.py):

Ticket Created
    ↓
ReliabilityCalculator.calculate_reliability()
    ├─ AI Confidence Score (25%)
    ├─ Human Feedback History (30%)
    ├─ Historical Success Rate (25%)
    └─ Pattern Matching (20%)
    ↓
Overall Reliability Score (0-100%)
    ↓
AutoRemediationDecisionEngine.should_execute()
    ├─ Check if enabled for ticket
    ├─ Check minimum reliability (85%)
    ├─ Check action risk level
    ├─ Check rate limits
    └─ Determine if approval needed
    ↓
AutoRemediationEngine.execute_remediation()
    ├─ Pre-execution checks
    ├─ Execute via MCP Client
    ├─ Post-execution validation
    └─ Log everything

Key Classes:

ReliabilityCalculator: Calculates weighted reliability score
AutoRemediationDecisionEngine: Decides if/how to execute
AutoRemediationEngine: Actually executes actions via MCP

4. MCP Client Integration

Location: src/datacenter_docs/mcp/client.py

MCP (Model Context Protocol) is the bridge to infrastructure. It's an external service that connects to VMware, Kubernetes, network devices, etc.

Important: MCP Client is EXTERNAL. We don't implement the infrastructure connections - we call MCP's API.

Operations:

Read operations: Get VM status, list pods, check network config
Write operations (auto-remediation): Restart VM, scale deployment, enable port

5. Documentation Agent (Agentic AI)

Location: src/datacenter_docs/chat/agent.py

Architecture Pattern: RAG (Retrieval Augmented Generation)

User Query
    ↓
Vector Search (ChromaDB + HuggingFace embeddings)
    ↓
Retrieve Top-K Relevant Docs
    ↓
Build Context + Query → LLM
    ↓
Generate Response with Citations

Key Methods:

search_documentation(): Semantic search in vector store
resolve_ticket(): Analyze problem + suggest resolution
chat_with_context(): Conversational interface with doc search

6. Missing Critical Components (TODO)

See TODO.md for comprehensive list. When implementing new features, check TODO.md first.

High Priority Missing Components:

CLI Tool (src/datacenter_docs/cli.py):
- Entry point: datacenter-docs command
- Uses Typer + Rich for CLI
- Commands: generate, serve, worker, init-db, stats
Celery Workers (src/datacenter_docs/workers/):
- celery_app.py: Celery configuration
- tasks.py: Async tasks (documentation generation, auto-remediation execution)
- Background task processing
Collectors (src/datacenter_docs/collectors/):
- Base class exists, implementations missing
- Need: VMware, Kubernetes, Network, Storage collectors
- Pattern: async def collect() -> dict
Generators (src/datacenter_docs/generators/):
- Base class exists, implementations missing
- Need: Infrastructure, Network, Virtualization generators
- Pattern: async def generate(data: dict) -> str (returns Markdown)

When implementing these:

Follow existing patterns in base classes
Use LLMClient for AI generation
Use MCPClient for infrastructure data collection
All operations are async
Use MongoDB/Beanie for storage

Code Patterns & Conventions

Async/Await

All operations use asyncio:

async def my_function():
    result = await some_async_call()

Type Hints

Type hints are required (mypy configured strictly):

async def process_ticket(ticket_id: str) -> Dict[str, Any]:
    ...

Logging

Use structured logging with module-level logger:

import logging

logger = logging.getLogger(__name__)

logger.info(f"Processing ticket {ticket_id}")
logger.error(f"Failed to execute action: {e}", exc_info=True)

Configuration

All config via src/datacenter_docs/utils/config.py using Pydantic Settings:

from datacenter_docs.utils.config import get_settings

settings = get_settings()
mongodb_url = settings.MONGODB_URL
llm_model = settings.LLM_MODEL

Error Handling

try:
    result = await risky_operation()
except SpecificException as e:
    logger.error(f"Operation failed: {e}", exc_info=True)
    return {"success": False, "error": str(e)}

Docker Development Workflow

Primary development environment: Docker Compose

Services in deploy/docker/docker-compose.dev.yml:

mongodb: MongoDB 7 (port 27017)
redis: Redis 7 (port 6379)
api: FastAPI service (port 8000)
chat: WebSocket chat server (port 8001) - NOT IMPLEMENTED
worker: Celery worker - NOT IMPLEMENTED
frontend: React + Nginx (port 80) - MINIMAL

Development cycle:

Edit code in src/
Rebuild and restart affected service: docker-compose -f docker-compose.dev.yml up --build -d api
Check logs: docker-compose -f docker-compose.dev.yml logs -f api
Test: Access http://localhost:8000/api/docs

Volume mounts: Source code is mounted, so changes are reflected (except for dependency changes which need rebuild).

CI/CD Pipelines

Three CI/CD systems configured (all use Python 3.12):

.github/workflows/build-deploy.yml: GitHub Actions
.gitlab-ci.yml: GitLab CI
.gitea/workflows/ci.yml: Gitea Actions

Pipeline stages:

Lint (Black, Ruff)
Type check (mypy)
Test (pytest)
Build Docker image
Deploy (if on main branch)

When modifying Python version: Update ALL three pipeline files.

Key Files Reference

Core Application:

src/datacenter_docs/api/main.py: FastAPI application entry point
src/datacenter_docs/api/models.py: MongoDB/Beanie models (all data structures)
src/datacenter_docs/utils/config.py: Configuration management
src/datacenter_docs/utils/llm_client.py: LLM provider abstraction

Auto-Remediation:

src/datacenter_docs/api/reliability.py: Reliability scoring and decision engine
src/datacenter_docs/api/auto_remediation.py: Execution engine with safety checks

Infrastructure Integration:

src/datacenter_docs/mcp/client.py: MCP protocol client
src/datacenter_docs/chat/agent.py: Documentation AI agent (RAG)

Configuration:

.env.example: Template with ALL config options (including LLM provider examples)
pyproject.toml: Dependencies, scripts, linting config (Black 100 char, Python 3.12)

Documentation:

README.md: User-facing documentation
TODO.md: CRITICAL - Current project status, missing components, roadmap
deploy/docker/README.md: Docker environment guide

Important Notes

Python Version

Use Python 3.12 (standardized across the project).

Database Queries

MongoDB queries look different from SQL:

# Find
tickets = await Ticket.find(Ticket.status == TicketStatus.PENDING).to_list()

# Find one
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")

# Update
ticket.status = TicketStatus.RESOLVED
await ticket.save()

# Complex query
tickets = await Ticket.find(
    Ticket.created_at > datetime.now() - timedelta(days=7),
    Ticket.category == "network"
).to_list()

LLM API Calls

Use the generic client:

from datacenter_docs.utils.llm_client import get_llm_client

llm = get_llm_client()
response = await llm.chat_completion(messages=[...])

Auto-Remediation Safety

When implementing new remediation actions:

Define action in RemediationAction model
Set appropriate ActionRiskLevel (low/medium/high/critical)
Implement pre/post validation checks
Add comprehensive logging
Test with dry_run=True first

Testing

Tests are minimal currently. When adding tests:

Use pytest-asyncio for async tests
Mock MCP client and LLM client
Test reliability calculations thoroughly
Test safety checks in auto-remediation

When Implementing New Features

Check TODO.md first - component might be partially implemented
Follow existing patterns in similar components
Use type hints (mypy is strict)
Use LLMClient for AI operations
Use Beanie ORM for database operations
All operations are async (use async/await)
Test in Docker (primary development environment)
Update TODO.md when marking components as completed

Questions? Check These Files

"How do I configure the LLM provider?" → .env.example, utils/config.py, utils/llm_client.py
"How does auto-remediation work?" → api/reliability.py, api/auto_remediation.py
"What's not implemented yet?" → TODO.md (comprehensive list with estimates)
"How do I run tests/lint?" → pyproject.toml (all commands), this file
"Database schema?" → api/models.py (all Beanie models)
"Docker services?" → deploy/docker/docker-compose.dev.yml, deploy/docker/README.md
"API endpoints?" → api/main.py, or http://localhost:8000/api/docs when running

Last Updated: 2025-10-19 Project Status: 35% complete (Infrastructure done, business logic pending) Next Priority: CLI tool → Celery workers → Collectors → Generators

13 KiB Raw Blame History

CLAUDE.md

Project Overview

Essential Commands

Development Environment Setup

Testing & Code Quality

Running Services Locally

Database Operations

High-Level Architecture

1. LLM Provider System (OpenAI-Compatible API)

2. Database Architecture (MongoDB + Beanie ODM)

3. Auto-Remediation Decision Flow

4. MCP Client Integration

5. Documentation Agent (Agentic AI)

6. Missing Critical Components (TODO)

Code Patterns & Conventions

Async/Await

Type Hints

Logging

Configuration

Error Handling

Docker Development Workflow

CI/CD Pipelines

Key Files Reference

Important Notes

Python Version

Database Queries

LLM API Calls

Auto-Remediation Safety

Testing

When Implementing New Features

Questions? Check These Files

13 KiB

Raw Blame History