llm-automation-docs-and-rem…/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

---

## Project Overview

**LLM Automation - Docs & Remediation Engine**: AI-powered datacenter documentation generation with autonomous problem resolution capabilities. The system uses LLMs to automatically generate infrastructure documentation and can autonomously execute remediation actions on datacenter infrastructure.

**Current Status**: ~35% complete - Infrastructure and API are functional, but CLI tool, Celery workers, collectors, and generators are not yet implemented.

**Language**: Python 3.12 (standardized across entire project)

**Database**: MongoDB with Beanie ODM (async, document-based)

---

## Essential Commands

### Development Environment Setup

**NOTE for Fedora Users**: Replace `docker-compose` with `podman-compose` in all commands below. Podman is the default container engine on Fedora and is Docker-compatible.

```bash
# Install dependencies
poetry install

# Start Docker development stack (6 services: MongoDB, Redis, API, Chat, Worker, Frontend)
# On Fedora: use 'podman-compose' instead of 'docker-compose'
cd deploy/docker
docker-compose -f docker-compose.dev.yml up --build -d

# Check service status
docker-compose -f docker-compose.dev.yml ps

# View logs
docker-compose -f docker-compose.dev.yml logs -f api
docker-compose -f docker-compose.dev.yml logs -f --tail=50 api

# Stop services
docker-compose -f docker-compose.dev.yml down

# Restart single service after code changes
docker-compose -f docker-compose.dev.yml restart api
```

### Testing & Code Quality

```bash
# Run all tests
poetry run pytest

# Run specific test file
poetry run pytest tests/test_reliability.py

# Run with coverage
poetry run pytest --cov=src/datacenter_docs --cov-report=html

# Linting
poetry run black src/
poetry run ruff check src/
poetry run mypy src/

# Format code (100 char line length)
poetry run black src/ tests/
```

### Running Services Locally

```bash
# API server (development with auto-reload)
poetry run uvicorn datacenter_docs.api.main:app --reload --host 0.0.0.0 --port 8000

# CLI tool (NOT YET IMPLEMENTED - needs src/datacenter_docs/cli.py)
poetry run datacenter-docs --help

# Celery worker (NOT YET IMPLEMENTED - needs src/datacenter_docs/workers/)
poetry run docs-worker

# Chat server (NOT YET IMPLEMENTED - needs src/datacenter_docs/chat/main.py)
poetry run docs-chat
```

### Database Operations

```bash
# Access MongoDB shell in Docker (use 'podman' instead of 'docker' on Fedora)
docker exec -it datacenter-docs-mongodb-dev mongosh -u admin -p admin123

# Access Redis CLI (use 'podman' instead of 'docker' on Fedora)
docker exec -it datacenter-docs-redis-dev redis-cli

# Check database connectivity
curl http://localhost:8000/health
```

---

## High-Level Architecture

### 1. **LLM Provider System (OpenAI-Compatible API)**

**Location**: `src/datacenter_docs/utils/llm_client.py`

**Key Concept**: All LLM interactions go through `LLMClient` which uses the OpenAI SDK and can connect to ANY OpenAI-compatible provider:
- OpenAI (GPT-4, GPT-3.5)
- Anthropic Claude (via OpenAI-compatible endpoint)
- LLMStudio (local models)
- Open-WebUI (local models)
- Ollama (local models)

**Configuration** (in `.env`):
```bash
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-your-key
LLM_MODEL=gpt-4-turbo-preview
```

**Usage**:
```python
from datacenter_docs.utils.llm_client import get_llm_client

llm = get_llm_client()
response = await llm.chat_completion(messages=[...])
json_response = await llm.generate_json(messages=[...])
```

### 2. **Database Architecture (MongoDB + Beanie ODM)**

**Location**: `src/datacenter_docs/api/models.py`

**Key Characteristics**:
- Models inherit from `beanie.Document`
- MongoDB atomic operations
- Async operations: `await Ticket.find_one()`, `await ticket.save()`
- ObjectId for primary keys: `PydanticObjectId`
- Supports embedded documents and references

**Example**:
```python
from beanie import Document, PydanticObjectId
from datetime import datetime

class Ticket(Document):
    ticket_id: str
    status: TicketStatus
    created_at: datetime = datetime.now()

    class Settings:
        name = "tickets"  # Collection name
        indexes = ["ticket_id", "status"]

# Usage
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")
ticket.status = TicketStatus.RESOLVED
await ticket.save()
```

### 3. **Auto-Remediation Decision Flow**

**Multi-layered safety system** that decides whether AI can execute infrastructure changes.

**Flow** (`src/datacenter_docs/api/reliability.py` → `auto_remediation.py`):

```
Ticket Created
    ↓
ReliabilityCalculator.calculate_reliability()
    ├─ AI Confidence Score (25%)
    ├─ Human Feedback History (30%)
    ├─ Historical Success Rate (25%)
    └─ Pattern Matching (20%)
    ↓
Overall Reliability Score (0-100%)
    ↓
AutoRemediationDecisionEngine.should_execute()
    ├─ Check if enabled for ticket
    ├─ Check minimum reliability (85%)
    ├─ Check action risk level
    ├─ Check rate limits
    └─ Determine if approval needed
    ↓
AutoRemediationEngine.execute_remediation()
    ├─ Pre-execution checks
    ├─ Execute via MCP Client
    ├─ Post-execution validation
    └─ Log everything
```

**Key Classes**:
- `ReliabilityCalculator`: Calculates weighted reliability score
- `AutoRemediationDecisionEngine`: Decides if/how to execute
- `AutoRemediationEngine`: Actually executes actions via MCP

### 4. **MCP Client Integration**

**Location**: `src/datacenter_docs/mcp/client.py`

MCP (Model Context Protocol) is the bridge to infrastructure. It's an external service that connects to VMware, Kubernetes, network devices, etc.

**Important**: MCP Client is EXTERNAL. We don't implement the infrastructure connections - we call MCP's API.

**Operations**:
- Read operations: Get VM status, list pods, check network config
- Write operations (auto-remediation): Restart VM, scale deployment, enable port

### 5. **Documentation Agent (Agentic AI)**

**Location**: `src/datacenter_docs/chat/agent.py`

**Architecture Pattern**: RAG (Retrieval Augmented Generation)

```
User Query
    ↓
Vector Search (ChromaDB + HuggingFace embeddings)
    ↓
Retrieve Top-K Relevant Docs
    ↓
Build Context + Query → LLM
    ↓
Generate Response with Citations
```

**Key Methods**:
- `search_documentation()`: Semantic search in vector store
- `resolve_ticket()`: Analyze problem + suggest resolution
- `chat_with_context()`: Conversational interface with doc search

### 6. **Missing Critical Components** (TODO)

**See `TODO.md` for comprehensive list**. When implementing new features, check TODO.md first.

**High Priority Missing Components**:

1. **CLI Tool** (`src/datacenter_docs/cli.py`):
   - Entry point: `datacenter-docs` command
   - Uses Typer + Rich for CLI
   - Commands: generate, serve, worker, init-db, stats

2. **Celery Workers** (`src/datacenter_docs/workers/`):
   - `celery_app.py`: Celery configuration
   - `tasks.py`: Async tasks (documentation generation, auto-remediation execution)
   - Background task processing

3. **Collectors** (`src/datacenter_docs/collectors/`):
   - Base class exists, implementations missing
   - Need: VMware, Kubernetes, Network, Storage collectors
   - Pattern: `async def collect() -> dict`

4. **Generators** (`src/datacenter_docs/generators/`):
   - Base class exists, implementations missing
   - Need: Infrastructure, Network, Virtualization generators
   - Pattern: `async def generate(data: dict) -> str` (returns Markdown)

**When implementing these**:
- Follow existing patterns in base classes
- Use `LLMClient` for AI generation
- Use `MCPClient` for infrastructure data collection
- All operations are async
- Use MongoDB/Beanie for storage

---

## Code Patterns & Conventions

### Async/Await

All operations use asyncio:

```python
async def my_function():
    result = await some_async_call()
```

### Type Hints

Type hints are required (mypy configured strictly):

```python
async def process_ticket(ticket_id: str) -> Dict[str, Any]:
    ...
```

### Logging

Use structured logging with module-level logger:

```python
import logging

logger = logging.getLogger(__name__)

logger.info(f"Processing ticket {ticket_id}")
logger.error(f"Failed to execute action: {e}", exc_info=True)
```

### Configuration

All config via `src/datacenter_docs/utils/config.py` using Pydantic Settings:

```python
from datacenter_docs.utils.config import get_settings

settings = get_settings()
mongodb_url = settings.MONGODB_URL
llm_model = settings.LLM_MODEL
```

### Error Handling

```python
try:
    result = await risky_operation()
except SpecificException as e:
    logger.error(f"Operation failed: {e}", exc_info=True)
    return {"success": False, "error": str(e)}
```

---

## Docker Development Workflow

**Primary development environment**: Docker Compose

**Fedora Users**: Use `podman-compose` instead of `docker-compose` and `podman` instead of `docker` for all commands. Podman is the default container engine on Fedora and is Docker-compatible.

**Services in `deploy/docker/docker-compose.dev.yml`**:
- `mongodb`: MongoDB 7 (port 27017)
- `redis`: Redis 7 (port 6379)
- `api`: FastAPI service (port 8000)
- `chat`: WebSocket chat server (port 8001) - **NOT IMPLEMENTED**
- `worker`: Celery worker - **NOT IMPLEMENTED**
- `frontend`: React + Nginx (port 80) - **MINIMAL**

**Development cycle**:
1. Edit code in `src/`
2. Rebuild and restart affected service: `docker-compose -f docker-compose.dev.yml up --build -d api` (use `podman-compose` on Fedora)
3. Check logs: `docker-compose -f docker-compose.dev.yml logs -f api` (use `podman-compose` on Fedora)
4. Test: Access http://localhost:8000/api/docs

**Volume mounts**: Source code is mounted, so changes are reflected (except for dependency changes which need rebuild).

---

## CI/CD Pipelines

**Three CI/CD systems configured** (all use Python 3.12):
- `.github/workflows/build-deploy.yml`: GitHub Actions
- `.gitlab-ci.yml`: GitLab CI
- `.gitea/workflows/ci.yml`: Gitea Actions

**Pipeline stages**:
1. Lint (Black, Ruff)
2. Type check (mypy)
3. Test (pytest)
4. Build Docker image
5. Deploy (if on main branch)

**When modifying Python version**: Update ALL three pipeline files.

---

## Key Files Reference

**Core Application**:
- `src/datacenter_docs/api/main.py`: FastAPI application entry point
- `src/datacenter_docs/api/models.py`: MongoDB/Beanie models (all data structures)
- `src/datacenter_docs/utils/config.py`: Configuration management
- `src/datacenter_docs/utils/llm_client.py`: LLM provider abstraction

**Auto-Remediation**:
- `src/datacenter_docs/api/reliability.py`: Reliability scoring and decision engine
- `src/datacenter_docs/api/auto_remediation.py`: Execution engine with safety checks

**Infrastructure Integration**:
- `src/datacenter_docs/mcp/client.py`: MCP protocol client
- `src/datacenter_docs/chat/agent.py`: Documentation AI agent (RAG)

**Configuration**:
- `.env.example`: Template with ALL config options (including LLM provider examples)
- `pyproject.toml`: Dependencies, scripts, linting config (Black 100 char, Python 3.12)

**Documentation**:
- `README.md`: User-facing documentation
- `TODO.md`: **CRITICAL** - Current project status, missing components, roadmap
- `deploy/docker/README.md`: Docker environment guide

---

## Important Notes

### Python Version
Use Python 3.12 (standardized across the project).

### Database Queries
MongoDB queries look different from SQL:
```python
# Find
tickets = await Ticket.find(Ticket.status == TicketStatus.PENDING).to_list()

# Find one
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")

# Update
ticket.status = TicketStatus.RESOLVED
await ticket.save()

# Complex query
tickets = await Ticket.find(
    Ticket.created_at > datetime.now() - timedelta(days=7),
    Ticket.category == "network"
).to_list()
```

### LLM API Calls
Use the generic client:
```python
from datacenter_docs.utils.llm_client import get_llm_client

llm = get_llm_client()
response = await llm.chat_completion(messages=[...])
```

### Auto-Remediation Safety
When implementing new remediation actions:
1. Define action in `RemediationAction` model
2. Set appropriate `ActionRiskLevel` (low/medium/high/critical)
3. Implement pre/post validation checks
4. Add comprehensive logging
5. Test with `dry_run=True` first

### Testing
Tests are minimal currently. When adding tests:
- Use `pytest-asyncio` for async tests
- Mock MCP client and LLM client
- Test reliability calculations thoroughly
- Test safety checks in auto-remediation

---

## When Implementing New Features

1. Check `TODO.md` first - component might be partially implemented
2. Follow existing patterns in similar components
3. Use type hints (mypy is strict)
4. Use `LLMClient` for AI operations
5. Use Beanie ORM for database operations
6. All operations are async (use async/await)
7. Test in Docker (primary development environment)
8. Update `TODO.md` when marking components as completed

---

## Questions? Check These Files

- **"How do I configure the LLM provider?"** → `.env.example`, `utils/config.py`, `utils/llm_client.py`
- **"How does auto-remediation work?"** → `api/reliability.py`, `api/auto_remediation.py`
- **"What's not implemented yet?"** → `TODO.md` (comprehensive list with estimates)
- **"How do I run tests/lint?"** → `pyproject.toml` (all commands), this file
- **"Database schema?"** → `api/models.py` (all Beanie models)
- **"Docker services?"** → `deploy/docker/docker-compose.dev.yml`, `deploy/docker/README.md`
- **"API endpoints?"** → `api/main.py`, or http://localhost:8000/api/docs when running

---

**Last Updated**: 2025-10-19
**Project Status**: 35% complete (Infrastructure done, business logic pending)
**Next Priority**: CLI tool → Celery workers → Collectors → Generators