13 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
LLM Automation - Docs & Remediation Engine: AI-powered datacenter documentation generation with autonomous problem resolution capabilities. The system uses LLMs to automatically generate infrastructure documentation and can autonomously execute remediation actions on datacenter infrastructure.
Current Status: ~35% complete - Infrastructure and API are functional, but CLI tool, Celery workers, collectors, and generators are not yet implemented.
Language: Python 3.12 (standardized across entire project)
Database: MongoDB with Beanie ODM (async, document-based)
Essential Commands
Development Environment Setup
# Install dependencies
poetry install
# Start Docker development stack (6 services: MongoDB, Redis, API, Chat, Worker, Frontend)
cd deploy/docker
docker-compose -f docker-compose.dev.yml up --build -d
# Check service status
docker-compose -f docker-compose.dev.yml ps
# View logs
docker-compose -f docker-compose.dev.yml logs -f api
docker-compose -f docker-compose.dev.yml logs -f --tail=50 api
# Stop services
docker-compose -f docker-compose.dev.yml down
# Restart single service after code changes
docker-compose -f docker-compose.dev.yml restart api
Testing & Code Quality
# Run all tests
poetry run pytest
# Run specific test file
poetry run pytest tests/test_reliability.py
# Run with coverage
poetry run pytest --cov=src/datacenter_docs --cov-report=html
# Linting
poetry run black src/
poetry run ruff check src/
poetry run mypy src/
# Format code (100 char line length)
poetry run black src/ tests/
Running Services Locally
# API server (development with auto-reload)
poetry run uvicorn datacenter_docs.api.main:app --reload --host 0.0.0.0 --port 8000
# CLI tool (NOT YET IMPLEMENTED - needs src/datacenter_docs/cli.py)
poetry run datacenter-docs --help
# Celery worker (NOT YET IMPLEMENTED - needs src/datacenter_docs/workers/)
poetry run docs-worker
# Chat server (NOT YET IMPLEMENTED - needs src/datacenter_docs/chat/main.py)
poetry run docs-chat
Database Operations
# Access MongoDB shell in Docker
docker exec -it datacenter-docs-mongodb-dev mongosh -u admin -p admin123
# Access Redis CLI
docker exec -it datacenter-docs-redis-dev redis-cli
# Check database connectivity
curl http://localhost:8000/health
High-Level Architecture
1. LLM Provider System (OpenAI-Compatible API)
Location: src/datacenter_docs/utils/llm_client.py
Key Concept: All LLM interactions go through LLMClient which uses the OpenAI SDK and can connect to ANY OpenAI-compatible provider:
- OpenAI (GPT-4, GPT-3.5)
- Anthropic Claude (via OpenAI-compatible endpoint)
- LLMStudio (local models)
- Open-WebUI (local models)
- Ollama (local models)
Configuration (in .env):
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-your-key
LLM_MODEL=gpt-4-turbo-preview
Usage:
from datacenter_docs.utils.llm_client import get_llm_client
llm = get_llm_client()
response = await llm.chat_completion(messages=[...])
json_response = await llm.generate_json(messages=[...])
2. Database Architecture (MongoDB + Beanie ODM)
Location: src/datacenter_docs/api/models.py
Key Characteristics:
- Models inherit from
beanie.Document - MongoDB atomic operations
- Async operations:
await Ticket.find_one(),await ticket.save() - ObjectId for primary keys:
PydanticObjectId - Supports embedded documents and references
Example:
from beanie import Document, PydanticObjectId
from datetime import datetime
class Ticket(Document):
ticket_id: str
status: TicketStatus
created_at: datetime = datetime.now()
class Settings:
name = "tickets" # Collection name
indexes = ["ticket_id", "status"]
# Usage
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")
ticket.status = TicketStatus.RESOLVED
await ticket.save()
3. Auto-Remediation Decision Flow
Multi-layered safety system that decides whether AI can execute infrastructure changes.
Flow (src/datacenter_docs/api/reliability.py → auto_remediation.py):
Ticket Created
↓
ReliabilityCalculator.calculate_reliability()
├─ AI Confidence Score (25%)
├─ Human Feedback History (30%)
├─ Historical Success Rate (25%)
└─ Pattern Matching (20%)
↓
Overall Reliability Score (0-100%)
↓
AutoRemediationDecisionEngine.should_execute()
├─ Check if enabled for ticket
├─ Check minimum reliability (85%)
├─ Check action risk level
├─ Check rate limits
└─ Determine if approval needed
↓
AutoRemediationEngine.execute_remediation()
├─ Pre-execution checks
├─ Execute via MCP Client
├─ Post-execution validation
└─ Log everything
Key Classes:
ReliabilityCalculator: Calculates weighted reliability scoreAutoRemediationDecisionEngine: Decides if/how to executeAutoRemediationEngine: Actually executes actions via MCP
4. MCP Client Integration
Location: src/datacenter_docs/mcp/client.py
MCP (Model Context Protocol) is the bridge to infrastructure. It's an external service that connects to VMware, Kubernetes, network devices, etc.
Important: MCP Client is EXTERNAL. We don't implement the infrastructure connections - we call MCP's API.
Operations:
- Read operations: Get VM status, list pods, check network config
- Write operations (auto-remediation): Restart VM, scale deployment, enable port
5. Documentation Agent (Agentic AI)
Location: src/datacenter_docs/chat/agent.py
Architecture Pattern: RAG (Retrieval Augmented Generation)
User Query
↓
Vector Search (ChromaDB + HuggingFace embeddings)
↓
Retrieve Top-K Relevant Docs
↓
Build Context + Query → LLM
↓
Generate Response with Citations
Key Methods:
search_documentation(): Semantic search in vector storeresolve_ticket(): Analyze problem + suggest resolutionchat_with_context(): Conversational interface with doc search
6. Missing Critical Components (TODO)
See TODO.md for comprehensive list. When implementing new features, check TODO.md first.
High Priority Missing Components:
-
CLI Tool (
src/datacenter_docs/cli.py):- Entry point:
datacenter-docscommand - Uses Typer + Rich for CLI
- Commands: generate, serve, worker, init-db, stats
- Entry point:
-
Celery Workers (
src/datacenter_docs/workers/):celery_app.py: Celery configurationtasks.py: Async tasks (documentation generation, auto-remediation execution)- Background task processing
-
Collectors (
src/datacenter_docs/collectors/):- Base class exists, implementations missing
- Need: VMware, Kubernetes, Network, Storage collectors
- Pattern:
async def collect() -> dict
-
Generators (
src/datacenter_docs/generators/):- Base class exists, implementations missing
- Need: Infrastructure, Network, Virtualization generators
- Pattern:
async def generate(data: dict) -> str(returns Markdown)
When implementing these:
- Follow existing patterns in base classes
- Use
LLMClientfor AI generation - Use
MCPClientfor infrastructure data collection - All operations are async
- Use MongoDB/Beanie for storage
Code Patterns & Conventions
Async/Await
All operations use asyncio:
async def my_function():
result = await some_async_call()
Type Hints
Type hints are required (mypy configured strictly):
async def process_ticket(ticket_id: str) -> Dict[str, Any]:
...
Logging
Use structured logging with module-level logger:
import logging
logger = logging.getLogger(__name__)
logger.info(f"Processing ticket {ticket_id}")
logger.error(f"Failed to execute action: {e}", exc_info=True)
Configuration
All config via src/datacenter_docs/utils/config.py using Pydantic Settings:
from datacenter_docs.utils.config import get_settings
settings = get_settings()
mongodb_url = settings.MONGODB_URL
llm_model = settings.LLM_MODEL
Error Handling
try:
result = await risky_operation()
except SpecificException as e:
logger.error(f"Operation failed: {e}", exc_info=True)
return {"success": False, "error": str(e)}
Docker Development Workflow
Primary development environment: Docker Compose
Services in deploy/docker/docker-compose.dev.yml:
mongodb: MongoDB 7 (port 27017)redis: Redis 7 (port 6379)api: FastAPI service (port 8000)chat: WebSocket chat server (port 8001) - NOT IMPLEMENTEDworker: Celery worker - NOT IMPLEMENTEDfrontend: React + Nginx (port 80) - MINIMAL
Development cycle:
- Edit code in
src/ - Rebuild and restart affected service:
docker-compose -f docker-compose.dev.yml up --build -d api - Check logs:
docker-compose -f docker-compose.dev.yml logs -f api - Test: Access http://localhost:8000/api/docs
Volume mounts: Source code is mounted, so changes are reflected (except for dependency changes which need rebuild).
CI/CD Pipelines
Three CI/CD systems configured (all use Python 3.12):
.github/workflows/build-deploy.yml: GitHub Actions.gitlab-ci.yml: GitLab CI.gitea/workflows/ci.yml: Gitea Actions
Pipeline stages:
- Lint (Black, Ruff)
- Type check (mypy)
- Test (pytest)
- Build Docker image
- Deploy (if on main branch)
When modifying Python version: Update ALL three pipeline files.
Key Files Reference
Core Application:
src/datacenter_docs/api/main.py: FastAPI application entry pointsrc/datacenter_docs/api/models.py: MongoDB/Beanie models (all data structures)src/datacenter_docs/utils/config.py: Configuration managementsrc/datacenter_docs/utils/llm_client.py: LLM provider abstraction
Auto-Remediation:
src/datacenter_docs/api/reliability.py: Reliability scoring and decision enginesrc/datacenter_docs/api/auto_remediation.py: Execution engine with safety checks
Infrastructure Integration:
src/datacenter_docs/mcp/client.py: MCP protocol clientsrc/datacenter_docs/chat/agent.py: Documentation AI agent (RAG)
Configuration:
.env.example: Template with ALL config options (including LLM provider examples)pyproject.toml: Dependencies, scripts, linting config (Black 100 char, Python 3.12)
Documentation:
README.md: User-facing documentationTODO.md: CRITICAL - Current project status, missing components, roadmapdeploy/docker/README.md: Docker environment guide
Important Notes
Python Version
Use Python 3.12 (standardized across the project).
Database Queries
MongoDB queries look different from SQL:
# Find
tickets = await Ticket.find(Ticket.status == TicketStatus.PENDING).to_list()
# Find one
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")
# Update
ticket.status = TicketStatus.RESOLVED
await ticket.save()
# Complex query
tickets = await Ticket.find(
Ticket.created_at > datetime.now() - timedelta(days=7),
Ticket.category == "network"
).to_list()
LLM API Calls
Use the generic client:
from datacenter_docs.utils.llm_client import get_llm_client
llm = get_llm_client()
response = await llm.chat_completion(messages=[...])
Auto-Remediation Safety
When implementing new remediation actions:
- Define action in
RemediationActionmodel - Set appropriate
ActionRiskLevel(low/medium/high/critical) - Implement pre/post validation checks
- Add comprehensive logging
- Test with
dry_run=Truefirst
Testing
Tests are minimal currently. When adding tests:
- Use
pytest-asynciofor async tests - Mock MCP client and LLM client
- Test reliability calculations thoroughly
- Test safety checks in auto-remediation
When Implementing New Features
- Check
TODO.mdfirst - component might be partially implemented - Follow existing patterns in similar components
- Use type hints (mypy is strict)
- Use
LLMClientfor AI operations - Use Beanie ORM for database operations
- All operations are async (use async/await)
- Test in Docker (primary development environment)
- Update
TODO.mdwhen marking components as completed
Questions? Check These Files
- "How do I configure the LLM provider?" →
.env.example,utils/config.py,utils/llm_client.py - "How does auto-remediation work?" →
api/reliability.py,api/auto_remediation.py - "What's not implemented yet?" →
TODO.md(comprehensive list with estimates) - "How do I run tests/lint?" →
pyproject.toml(all commands), this file - "Database schema?" →
api/models.py(all Beanie models) - "Docker services?" →
deploy/docker/docker-compose.dev.yml,deploy/docker/README.md - "API endpoints?" →
api/main.py, or http://localhost:8000/api/docs when running
Last Updated: 2025-10-19 Project Status: 35% complete (Infrastructure done, business logic pending) Next Priority: CLI tool → Celery workers → Collectors → Generators