# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. --- ## Project Overview **LLM Automation - Docs & Remediation Engine**: AI-powered datacenter documentation generation with autonomous problem resolution capabilities. The system uses LLMs to automatically generate infrastructure documentation and can autonomously execute remediation actions on datacenter infrastructure. **Current Status**: ~35% complete - Infrastructure and API are functional, but CLI tool, Celery workers, collectors, and generators are not yet implemented. **Language**: Python 3.12 (standardized across entire project) **Database**: MongoDB with Beanie ODM (async, document-based) --- ## Essential Commands ### Development Environment Setup ```bash # Install dependencies poetry install # Start Docker development stack (6 services: MongoDB, Redis, API, Chat, Worker, Frontend) cd deploy/docker docker-compose -f docker-compose.dev.yml up --build -d # Check service status docker-compose -f docker-compose.dev.yml ps # View logs docker-compose -f docker-compose.dev.yml logs -f api docker-compose -f docker-compose.dev.yml logs -f --tail=50 api # Stop services docker-compose -f docker-compose.dev.yml down # Restart single service after code changes docker-compose -f docker-compose.dev.yml restart api ``` ### Testing & Code Quality ```bash # Run all tests poetry run pytest # Run specific test file poetry run pytest tests/test_reliability.py # Run with coverage poetry run pytest --cov=src/datacenter_docs --cov-report=html # Linting poetry run black src/ poetry run ruff check src/ poetry run mypy src/ # Format code (100 char line length) poetry run black src/ tests/ ``` ### Running Services Locally ```bash # API server (development with auto-reload) poetry run uvicorn datacenter_docs.api.main:app --reload --host 0.0.0.0 --port 8000 # CLI tool (NOT YET IMPLEMENTED - needs src/datacenter_docs/cli.py) poetry run datacenter-docs --help # Celery worker (NOT YET IMPLEMENTED - needs src/datacenter_docs/workers/) poetry run docs-worker # Chat server (NOT YET IMPLEMENTED - needs src/datacenter_docs/chat/main.py) poetry run docs-chat ``` ### Database Operations ```bash # Access MongoDB shell in Docker docker exec -it datacenter-docs-mongodb-dev mongosh -u admin -p admin123 # Access Redis CLI docker exec -it datacenter-docs-redis-dev redis-cli # Check database connectivity curl http://localhost:8000/health ``` --- ## High-Level Architecture ### 1. **LLM Provider System (OpenAI-Compatible API)** **Location**: `src/datacenter_docs/utils/llm_client.py` **Key Concept**: All LLM interactions go through `LLMClient` which uses the OpenAI SDK and can connect to ANY OpenAI-compatible provider: - OpenAI (GPT-4, GPT-3.5) - Anthropic Claude (via OpenAI-compatible endpoint) - LLMStudio (local models) - Open-WebUI (local models) - Ollama (local models) **Configuration** (in `.env`): ```bash LLM_BASE_URL=https://api.openai.com/v1 LLM_API_KEY=sk-your-key LLM_MODEL=gpt-4-turbo-preview ``` **Usage**: ```python from datacenter_docs.utils.llm_client import get_llm_client llm = get_llm_client() response = await llm.chat_completion(messages=[...]) json_response = await llm.generate_json(messages=[...]) ``` ### 2. **Database Architecture (MongoDB + Beanie ODM)** **Location**: `src/datacenter_docs/api/models.py` **Key Characteristics**: - Models inherit from `beanie.Document` - MongoDB atomic operations - Async operations: `await Ticket.find_one()`, `await ticket.save()` - ObjectId for primary keys: `PydanticObjectId` - Supports embedded documents and references **Example**: ```python from beanie import Document, PydanticObjectId from datetime import datetime class Ticket(Document): ticket_id: str status: TicketStatus created_at: datetime = datetime.now() class Settings: name = "tickets" # Collection name indexes = ["ticket_id", "status"] # Usage ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123") ticket.status = TicketStatus.RESOLVED await ticket.save() ``` ### 3. **Auto-Remediation Decision Flow** **Multi-layered safety system** that decides whether AI can execute infrastructure changes. **Flow** (`src/datacenter_docs/api/reliability.py` → `auto_remediation.py`): ``` Ticket Created ↓ ReliabilityCalculator.calculate_reliability() ├─ AI Confidence Score (25%) ├─ Human Feedback History (30%) ├─ Historical Success Rate (25%) └─ Pattern Matching (20%) ↓ Overall Reliability Score (0-100%) ↓ AutoRemediationDecisionEngine.should_execute() ├─ Check if enabled for ticket ├─ Check minimum reliability (85%) ├─ Check action risk level ├─ Check rate limits └─ Determine if approval needed ↓ AutoRemediationEngine.execute_remediation() ├─ Pre-execution checks ├─ Execute via MCP Client ├─ Post-execution validation └─ Log everything ``` **Key Classes**: - `ReliabilityCalculator`: Calculates weighted reliability score - `AutoRemediationDecisionEngine`: Decides if/how to execute - `AutoRemediationEngine`: Actually executes actions via MCP ### 4. **MCP Client Integration** **Location**: `src/datacenter_docs/mcp/client.py` MCP (Model Context Protocol) is the bridge to infrastructure. It's an external service that connects to VMware, Kubernetes, network devices, etc. **Important**: MCP Client is EXTERNAL. We don't implement the infrastructure connections - we call MCP's API. **Operations**: - Read operations: Get VM status, list pods, check network config - Write operations (auto-remediation): Restart VM, scale deployment, enable port ### 5. **Documentation Agent (Agentic AI)** **Location**: `src/datacenter_docs/chat/agent.py` **Architecture Pattern**: RAG (Retrieval Augmented Generation) ``` User Query ↓ Vector Search (ChromaDB + HuggingFace embeddings) ↓ Retrieve Top-K Relevant Docs ↓ Build Context + Query → LLM ↓ Generate Response with Citations ``` **Key Methods**: - `search_documentation()`: Semantic search in vector store - `resolve_ticket()`: Analyze problem + suggest resolution - `chat_with_context()`: Conversational interface with doc search ### 6. **Missing Critical Components** (TODO) **See `TODO.md` for comprehensive list**. When implementing new features, check TODO.md first. **High Priority Missing Components**: 1. **CLI Tool** (`src/datacenter_docs/cli.py`): - Entry point: `datacenter-docs` command - Uses Typer + Rich for CLI - Commands: generate, serve, worker, init-db, stats 2. **Celery Workers** (`src/datacenter_docs/workers/`): - `celery_app.py`: Celery configuration - `tasks.py`: Async tasks (documentation generation, auto-remediation execution) - Background task processing 3. **Collectors** (`src/datacenter_docs/collectors/`): - Base class exists, implementations missing - Need: VMware, Kubernetes, Network, Storage collectors - Pattern: `async def collect() -> dict` 4. **Generators** (`src/datacenter_docs/generators/`): - Base class exists, implementations missing - Need: Infrastructure, Network, Virtualization generators - Pattern: `async def generate(data: dict) -> str` (returns Markdown) **When implementing these**: - Follow existing patterns in base classes - Use `LLMClient` for AI generation - Use `MCPClient` for infrastructure data collection - All operations are async - Use MongoDB/Beanie for storage --- ## Code Patterns & Conventions ### Async/Await All operations use asyncio: ```python async def my_function(): result = await some_async_call() ``` ### Type Hints Type hints are required (mypy configured strictly): ```python async def process_ticket(ticket_id: str) -> Dict[str, Any]: ... ``` ### Logging Use structured logging with module-level logger: ```python import logging logger = logging.getLogger(__name__) logger.info(f"Processing ticket {ticket_id}") logger.error(f"Failed to execute action: {e}", exc_info=True) ``` ### Configuration All config via `src/datacenter_docs/utils/config.py` using Pydantic Settings: ```python from datacenter_docs.utils.config import get_settings settings = get_settings() mongodb_url = settings.MONGODB_URL llm_model = settings.LLM_MODEL ``` ### Error Handling ```python try: result = await risky_operation() except SpecificException as e: logger.error(f"Operation failed: {e}", exc_info=True) return {"success": False, "error": str(e)} ``` --- ## Docker Development Workflow **Primary development environment**: Docker Compose **Services in `deploy/docker/docker-compose.dev.yml`**: - `mongodb`: MongoDB 7 (port 27017) - `redis`: Redis 7 (port 6379) - `api`: FastAPI service (port 8000) - `chat`: WebSocket chat server (port 8001) - **NOT IMPLEMENTED** - `worker`: Celery worker - **NOT IMPLEMENTED** - `frontend`: React + Nginx (port 80) - **MINIMAL** **Development cycle**: 1. Edit code in `src/` 2. Rebuild and restart affected service: `docker-compose -f docker-compose.dev.yml up --build -d api` 3. Check logs: `docker-compose -f docker-compose.dev.yml logs -f api` 4. Test: Access http://localhost:8000/api/docs **Volume mounts**: Source code is mounted, so changes are reflected (except for dependency changes which need rebuild). --- ## CI/CD Pipelines **Three CI/CD systems configured** (all use Python 3.12): - `.github/workflows/build-deploy.yml`: GitHub Actions - `.gitlab-ci.yml`: GitLab CI - `.gitea/workflows/ci.yml`: Gitea Actions **Pipeline stages**: 1. Lint (Black, Ruff) 2. Type check (mypy) 3. Test (pytest) 4. Build Docker image 5. Deploy (if on main branch) **When modifying Python version**: Update ALL three pipeline files. --- ## Key Files Reference **Core Application**: - `src/datacenter_docs/api/main.py`: FastAPI application entry point - `src/datacenter_docs/api/models.py`: MongoDB/Beanie models (all data structures) - `src/datacenter_docs/utils/config.py`: Configuration management - `src/datacenter_docs/utils/llm_client.py`: LLM provider abstraction **Auto-Remediation**: - `src/datacenter_docs/api/reliability.py`: Reliability scoring and decision engine - `src/datacenter_docs/api/auto_remediation.py`: Execution engine with safety checks **Infrastructure Integration**: - `src/datacenter_docs/mcp/client.py`: MCP protocol client - `src/datacenter_docs/chat/agent.py`: Documentation AI agent (RAG) **Configuration**: - `.env.example`: Template with ALL config options (including LLM provider examples) - `pyproject.toml`: Dependencies, scripts, linting config (Black 100 char, Python 3.12) **Documentation**: - `README.md`: User-facing documentation - `TODO.md`: **CRITICAL** - Current project status, missing components, roadmap - `deploy/docker/README.md`: Docker environment guide --- ## Important Notes ### Python Version Use Python 3.12 (standardized across the project). ### Database Queries MongoDB queries look different from SQL: ```python # Find tickets = await Ticket.find(Ticket.status == TicketStatus.PENDING).to_list() # Find one ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123") # Update ticket.status = TicketStatus.RESOLVED await ticket.save() # Complex query tickets = await Ticket.find( Ticket.created_at > datetime.now() - timedelta(days=7), Ticket.category == "network" ).to_list() ``` ### LLM API Calls Use the generic client: ```python from datacenter_docs.utils.llm_client import get_llm_client llm = get_llm_client() response = await llm.chat_completion(messages=[...]) ``` ### Auto-Remediation Safety When implementing new remediation actions: 1. Define action in `RemediationAction` model 2. Set appropriate `ActionRiskLevel` (low/medium/high/critical) 3. Implement pre/post validation checks 4. Add comprehensive logging 5. Test with `dry_run=True` first ### Testing Tests are minimal currently. When adding tests: - Use `pytest-asyncio` for async tests - Mock MCP client and LLM client - Test reliability calculations thoroughly - Test safety checks in auto-remediation --- ## When Implementing New Features 1. Check `TODO.md` first - component might be partially implemented 2. Follow existing patterns in similar components 3. Use type hints (mypy is strict) 4. Use `LLMClient` for AI operations 5. Use Beanie ORM for database operations 6. All operations are async (use async/await) 7. Test in Docker (primary development environment) 8. Update `TODO.md` when marking components as completed --- ## Questions? Check These Files - **"How do I configure the LLM provider?"** → `.env.example`, `utils/config.py`, `utils/llm_client.py` - **"How does auto-remediation work?"** → `api/reliability.py`, `api/auto_remediation.py` - **"What's not implemented yet?"** → `TODO.md` (comprehensive list with estimates) - **"How do I run tests/lint?"** → `pyproject.toml` (all commands), this file - **"Database schema?"** → `api/models.py` (all Beanie models) - **"Docker services?"** → `deploy/docker/docker-compose.dev.yml`, `deploy/docker/README.md` - **"API endpoints?"** → `api/main.py`, or http://localhost:8000/api/docs when running --- **Last Updated**: 2025-10-19 **Project Status**: 35% complete (Infrastructure done, business logic pending) **Next Priority**: CLI tool → Celery workers → Collectors → Generators