Some checks failed
CI/CD Pipeline / Generate Documentation (push) Successful in 4m57s
CI/CD Pipeline / Lint Code (push) Successful in 5m33s
CI/CD Pipeline / Run Tests (push) Successful in 4m20s
CI/CD Pipeline / Security Scanning (push) Successful in 4m32s
CI/CD Pipeline / Build and Push Docker Images (chat) (push) Failing after 49s
CI/CD Pipeline / Build and Push Docker Images (frontend) (push) Failing after 48s
CI/CD Pipeline / Build and Push Docker Images (worker) (push) Failing after 46s
CI/CD Pipeline / Build and Push Docker Images (api) (push) Failing after 40s
CI/CD Pipeline / Deploy to Staging (push) Has been skipped
CI/CD Pipeline / Deploy to Production (push) Has been skipped
Complete implementation of core MVP components: CLI Tool (src/datacenter_docs/cli.py): - 11 commands for system management (serve, worker, init-db, generate, etc.) - Auto-remediation policy management (enable/disable/status) - System statistics and monitoring - Rich formatted output with tables and panels Celery Workers (src/datacenter_docs/workers/): - celery_app.py with 4 specialized queues (documentation, auto_remediation, data_collection, maintenance) - tasks.py with 8 async tasks integrated with MongoDB/Beanie - Celery Beat scheduling (6h docs, 1h data collection, 15m metrics, 2am cleanup) - Rate limiting (10 auto-remediation/h) and timeout configuration - Task lifecycle signals and comprehensive logging VMware Collector (src/datacenter_docs/collectors/): - BaseCollector abstract class with full workflow (connect/collect/validate/store/disconnect) - VMwareCollector for vSphere infrastructure data collection - Collects VMs, ESXi hosts, clusters, datastores, networks with statistics - MCP client integration with mock data fallback for development - MongoDB storage via AuditLog and data validation Documentation & Configuration: - Updated README.md with CLI commands and Workers sections - Updated TODO.md with project status (55% completion) - Added CLAUDE.md with comprehensive project instructions - Added Docker compose setup for development environment Project Status: - Completion: 50% -> 55% - MVP Milestone: 80% complete (only Infrastructure Generator remaining) - Estimated time to MVP: 1-2 days 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
466 lines
13 KiB
Markdown
466 lines
13 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
---
|
|
|
|
## Project Overview
|
|
|
|
**LLM Automation - Docs & Remediation Engine**: AI-powered datacenter documentation generation with autonomous problem resolution capabilities. The system uses LLMs to automatically generate infrastructure documentation and can autonomously execute remediation actions on datacenter infrastructure.
|
|
|
|
**Current Status**: ~35% complete - Infrastructure and API are functional, but CLI tool, Celery workers, collectors, and generators are not yet implemented.
|
|
|
|
**Language**: Python 3.12 (standardized across entire project)
|
|
|
|
**Database**: MongoDB with Beanie ODM (async, document-based)
|
|
|
|
---
|
|
|
|
## Essential Commands
|
|
|
|
### Development Environment Setup
|
|
|
|
```bash
|
|
# Install dependencies
|
|
poetry install
|
|
|
|
# Start Docker development stack (6 services: MongoDB, Redis, API, Chat, Worker, Frontend)
|
|
cd deploy/docker
|
|
docker-compose -f docker-compose.dev.yml up --build -d
|
|
|
|
# Check service status
|
|
docker-compose -f docker-compose.dev.yml ps
|
|
|
|
# View logs
|
|
docker-compose -f docker-compose.dev.yml logs -f api
|
|
docker-compose -f docker-compose.dev.yml logs -f --tail=50 api
|
|
|
|
# Stop services
|
|
docker-compose -f docker-compose.dev.yml down
|
|
|
|
# Restart single service after code changes
|
|
docker-compose -f docker-compose.dev.yml restart api
|
|
```
|
|
|
|
### Testing & Code Quality
|
|
|
|
```bash
|
|
# Run all tests
|
|
poetry run pytest
|
|
|
|
# Run specific test file
|
|
poetry run pytest tests/test_reliability.py
|
|
|
|
# Run with coverage
|
|
poetry run pytest --cov=src/datacenter_docs --cov-report=html
|
|
|
|
# Linting
|
|
poetry run black src/
|
|
poetry run ruff check src/
|
|
poetry run mypy src/
|
|
|
|
# Format code (100 char line length)
|
|
poetry run black src/ tests/
|
|
```
|
|
|
|
### Running Services Locally
|
|
|
|
```bash
|
|
# API server (development with auto-reload)
|
|
poetry run uvicorn datacenter_docs.api.main:app --reload --host 0.0.0.0 --port 8000
|
|
|
|
# CLI tool (NOT YET IMPLEMENTED - needs src/datacenter_docs/cli.py)
|
|
poetry run datacenter-docs --help
|
|
|
|
# Celery worker (NOT YET IMPLEMENTED - needs src/datacenter_docs/workers/)
|
|
poetry run docs-worker
|
|
|
|
# Chat server (NOT YET IMPLEMENTED - needs src/datacenter_docs/chat/main.py)
|
|
poetry run docs-chat
|
|
```
|
|
|
|
### Database Operations
|
|
|
|
```bash
|
|
# Access MongoDB shell in Docker
|
|
docker exec -it datacenter-docs-mongodb-dev mongosh -u admin -p admin123
|
|
|
|
# Access Redis CLI
|
|
docker exec -it datacenter-docs-redis-dev redis-cli
|
|
|
|
# Check database connectivity
|
|
curl http://localhost:8000/health
|
|
```
|
|
|
|
---
|
|
|
|
## High-Level Architecture
|
|
|
|
### 1. **LLM Provider System (OpenAI-Compatible API)**
|
|
|
|
**Location**: `src/datacenter_docs/utils/llm_client.py`
|
|
|
|
**Key Concept**: All LLM interactions go through `LLMClient` which uses the OpenAI SDK and can connect to ANY OpenAI-compatible provider:
|
|
- OpenAI (GPT-4, GPT-3.5)
|
|
- Anthropic Claude (via OpenAI-compatible endpoint)
|
|
- LLMStudio (local models)
|
|
- Open-WebUI (local models)
|
|
- Ollama (local models)
|
|
|
|
**Configuration** (in `.env`):
|
|
```bash
|
|
LLM_BASE_URL=https://api.openai.com/v1
|
|
LLM_API_KEY=sk-your-key
|
|
LLM_MODEL=gpt-4-turbo-preview
|
|
```
|
|
|
|
**Usage**:
|
|
```python
|
|
from datacenter_docs.utils.llm_client import get_llm_client
|
|
|
|
llm = get_llm_client()
|
|
response = await llm.chat_completion(messages=[...])
|
|
json_response = await llm.generate_json(messages=[...])
|
|
```
|
|
|
|
### 2. **Database Architecture (MongoDB + Beanie ODM)**
|
|
|
|
**Location**: `src/datacenter_docs/api/models.py`
|
|
|
|
**Key Characteristics**:
|
|
- Models inherit from `beanie.Document`
|
|
- MongoDB atomic operations
|
|
- Async operations: `await Ticket.find_one()`, `await ticket.save()`
|
|
- ObjectId for primary keys: `PydanticObjectId`
|
|
- Supports embedded documents and references
|
|
|
|
**Example**:
|
|
```python
|
|
from beanie import Document, PydanticObjectId
|
|
from datetime import datetime
|
|
|
|
class Ticket(Document):
|
|
ticket_id: str
|
|
status: TicketStatus
|
|
created_at: datetime = datetime.now()
|
|
|
|
class Settings:
|
|
name = "tickets" # Collection name
|
|
indexes = ["ticket_id", "status"]
|
|
|
|
# Usage
|
|
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")
|
|
ticket.status = TicketStatus.RESOLVED
|
|
await ticket.save()
|
|
```
|
|
|
|
### 3. **Auto-Remediation Decision Flow**
|
|
|
|
**Multi-layered safety system** that decides whether AI can execute infrastructure changes.
|
|
|
|
**Flow** (`src/datacenter_docs/api/reliability.py` → `auto_remediation.py`):
|
|
|
|
```
|
|
Ticket Created
|
|
↓
|
|
ReliabilityCalculator.calculate_reliability()
|
|
├─ AI Confidence Score (25%)
|
|
├─ Human Feedback History (30%)
|
|
├─ Historical Success Rate (25%)
|
|
└─ Pattern Matching (20%)
|
|
↓
|
|
Overall Reliability Score (0-100%)
|
|
↓
|
|
AutoRemediationDecisionEngine.should_execute()
|
|
├─ Check if enabled for ticket
|
|
├─ Check minimum reliability (85%)
|
|
├─ Check action risk level
|
|
├─ Check rate limits
|
|
└─ Determine if approval needed
|
|
↓
|
|
AutoRemediationEngine.execute_remediation()
|
|
├─ Pre-execution checks
|
|
├─ Execute via MCP Client
|
|
├─ Post-execution validation
|
|
└─ Log everything
|
|
```
|
|
|
|
**Key Classes**:
|
|
- `ReliabilityCalculator`: Calculates weighted reliability score
|
|
- `AutoRemediationDecisionEngine`: Decides if/how to execute
|
|
- `AutoRemediationEngine`: Actually executes actions via MCP
|
|
|
|
### 4. **MCP Client Integration**
|
|
|
|
**Location**: `src/datacenter_docs/mcp/client.py`
|
|
|
|
MCP (Model Context Protocol) is the bridge to infrastructure. It's an external service that connects to VMware, Kubernetes, network devices, etc.
|
|
|
|
**Important**: MCP Client is EXTERNAL. We don't implement the infrastructure connections - we call MCP's API.
|
|
|
|
**Operations**:
|
|
- Read operations: Get VM status, list pods, check network config
|
|
- Write operations (auto-remediation): Restart VM, scale deployment, enable port
|
|
|
|
### 5. **Documentation Agent (Agentic AI)**
|
|
|
|
**Location**: `src/datacenter_docs/chat/agent.py`
|
|
|
|
**Architecture Pattern**: RAG (Retrieval Augmented Generation)
|
|
|
|
```
|
|
User Query
|
|
↓
|
|
Vector Search (ChromaDB + HuggingFace embeddings)
|
|
↓
|
|
Retrieve Top-K Relevant Docs
|
|
↓
|
|
Build Context + Query → LLM
|
|
↓
|
|
Generate Response with Citations
|
|
```
|
|
|
|
**Key Methods**:
|
|
- `search_documentation()`: Semantic search in vector store
|
|
- `resolve_ticket()`: Analyze problem + suggest resolution
|
|
- `chat_with_context()`: Conversational interface with doc search
|
|
|
|
### 6. **Missing Critical Components** (TODO)
|
|
|
|
**See `TODO.md` for comprehensive list**. When implementing new features, check TODO.md first.
|
|
|
|
**High Priority Missing Components**:
|
|
|
|
1. **CLI Tool** (`src/datacenter_docs/cli.py`):
|
|
- Entry point: `datacenter-docs` command
|
|
- Uses Typer + Rich for CLI
|
|
- Commands: generate, serve, worker, init-db, stats
|
|
|
|
2. **Celery Workers** (`src/datacenter_docs/workers/`):
|
|
- `celery_app.py`: Celery configuration
|
|
- `tasks.py`: Async tasks (documentation generation, auto-remediation execution)
|
|
- Background task processing
|
|
|
|
3. **Collectors** (`src/datacenter_docs/collectors/`):
|
|
- Base class exists, implementations missing
|
|
- Need: VMware, Kubernetes, Network, Storage collectors
|
|
- Pattern: `async def collect() -> dict`
|
|
|
|
4. **Generators** (`src/datacenter_docs/generators/`):
|
|
- Base class exists, implementations missing
|
|
- Need: Infrastructure, Network, Virtualization generators
|
|
- Pattern: `async def generate(data: dict) -> str` (returns Markdown)
|
|
|
|
**When implementing these**:
|
|
- Follow existing patterns in base classes
|
|
- Use `LLMClient` for AI generation
|
|
- Use `MCPClient` for infrastructure data collection
|
|
- All operations are async
|
|
- Use MongoDB/Beanie for storage
|
|
|
|
---
|
|
|
|
## Code Patterns & Conventions
|
|
|
|
### Async/Await
|
|
|
|
All operations use asyncio:
|
|
|
|
```python
|
|
async def my_function():
|
|
result = await some_async_call()
|
|
```
|
|
|
|
### Type Hints
|
|
|
|
Type hints are required (mypy configured strictly):
|
|
|
|
```python
|
|
async def process_ticket(ticket_id: str) -> Dict[str, Any]:
|
|
...
|
|
```
|
|
|
|
### Logging
|
|
|
|
Use structured logging with module-level logger:
|
|
|
|
```python
|
|
import logging
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
logger.info(f"Processing ticket {ticket_id}")
|
|
logger.error(f"Failed to execute action: {e}", exc_info=True)
|
|
```
|
|
|
|
### Configuration
|
|
|
|
All config via `src/datacenter_docs/utils/config.py` using Pydantic Settings:
|
|
|
|
```python
|
|
from datacenter_docs.utils.config import get_settings
|
|
|
|
settings = get_settings()
|
|
mongodb_url = settings.MONGODB_URL
|
|
llm_model = settings.LLM_MODEL
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
```python
|
|
try:
|
|
result = await risky_operation()
|
|
except SpecificException as e:
|
|
logger.error(f"Operation failed: {e}", exc_info=True)
|
|
return {"success": False, "error": str(e)}
|
|
```
|
|
|
|
---
|
|
|
|
## Docker Development Workflow
|
|
|
|
**Primary development environment**: Docker Compose
|
|
|
|
**Services in `deploy/docker/docker-compose.dev.yml`**:
|
|
- `mongodb`: MongoDB 7 (port 27017)
|
|
- `redis`: Redis 7 (port 6379)
|
|
- `api`: FastAPI service (port 8000)
|
|
- `chat`: WebSocket chat server (port 8001) - **NOT IMPLEMENTED**
|
|
- `worker`: Celery worker - **NOT IMPLEMENTED**
|
|
- `frontend`: React + Nginx (port 80) - **MINIMAL**
|
|
|
|
**Development cycle**:
|
|
1. Edit code in `src/`
|
|
2. Rebuild and restart affected service: `docker-compose -f docker-compose.dev.yml up --build -d api`
|
|
3. Check logs: `docker-compose -f docker-compose.dev.yml logs -f api`
|
|
4. Test: Access http://localhost:8000/api/docs
|
|
|
|
**Volume mounts**: Source code is mounted, so changes are reflected (except for dependency changes which need rebuild).
|
|
|
|
---
|
|
|
|
## CI/CD Pipelines
|
|
|
|
**Three CI/CD systems configured** (all use Python 3.12):
|
|
- `.github/workflows/build-deploy.yml`: GitHub Actions
|
|
- `.gitlab-ci.yml`: GitLab CI
|
|
- `.gitea/workflows/ci.yml`: Gitea Actions
|
|
|
|
**Pipeline stages**:
|
|
1. Lint (Black, Ruff)
|
|
2. Type check (mypy)
|
|
3. Test (pytest)
|
|
4. Build Docker image
|
|
5. Deploy (if on main branch)
|
|
|
|
**When modifying Python version**: Update ALL three pipeline files.
|
|
|
|
---
|
|
|
|
## Key Files Reference
|
|
|
|
**Core Application**:
|
|
- `src/datacenter_docs/api/main.py`: FastAPI application entry point
|
|
- `src/datacenter_docs/api/models.py`: MongoDB/Beanie models (all data structures)
|
|
- `src/datacenter_docs/utils/config.py`: Configuration management
|
|
- `src/datacenter_docs/utils/llm_client.py`: LLM provider abstraction
|
|
|
|
**Auto-Remediation**:
|
|
- `src/datacenter_docs/api/reliability.py`: Reliability scoring and decision engine
|
|
- `src/datacenter_docs/api/auto_remediation.py`: Execution engine with safety checks
|
|
|
|
**Infrastructure Integration**:
|
|
- `src/datacenter_docs/mcp/client.py`: MCP protocol client
|
|
- `src/datacenter_docs/chat/agent.py`: Documentation AI agent (RAG)
|
|
|
|
**Configuration**:
|
|
- `.env.example`: Template with ALL config options (including LLM provider examples)
|
|
- `pyproject.toml`: Dependencies, scripts, linting config (Black 100 char, Python 3.12)
|
|
|
|
**Documentation**:
|
|
- `README.md`: User-facing documentation
|
|
- `TODO.md`: **CRITICAL** - Current project status, missing components, roadmap
|
|
- `deploy/docker/README.md`: Docker environment guide
|
|
|
|
---
|
|
|
|
## Important Notes
|
|
|
|
### Python Version
|
|
Use Python 3.12 (standardized across the project).
|
|
|
|
### Database Queries
|
|
MongoDB queries look different from SQL:
|
|
```python
|
|
# Find
|
|
tickets = await Ticket.find(Ticket.status == TicketStatus.PENDING).to_list()
|
|
|
|
# Find one
|
|
ticket = await Ticket.find_one(Ticket.ticket_id == "INC-123")
|
|
|
|
# Update
|
|
ticket.status = TicketStatus.RESOLVED
|
|
await ticket.save()
|
|
|
|
# Complex query
|
|
tickets = await Ticket.find(
|
|
Ticket.created_at > datetime.now() - timedelta(days=7),
|
|
Ticket.category == "network"
|
|
).to_list()
|
|
```
|
|
|
|
### LLM API Calls
|
|
Use the generic client:
|
|
```python
|
|
from datacenter_docs.utils.llm_client import get_llm_client
|
|
|
|
llm = get_llm_client()
|
|
response = await llm.chat_completion(messages=[...])
|
|
```
|
|
|
|
### Auto-Remediation Safety
|
|
When implementing new remediation actions:
|
|
1. Define action in `RemediationAction` model
|
|
2. Set appropriate `ActionRiskLevel` (low/medium/high/critical)
|
|
3. Implement pre/post validation checks
|
|
4. Add comprehensive logging
|
|
5. Test with `dry_run=True` first
|
|
|
|
### Testing
|
|
Tests are minimal currently. When adding tests:
|
|
- Use `pytest-asyncio` for async tests
|
|
- Mock MCP client and LLM client
|
|
- Test reliability calculations thoroughly
|
|
- Test safety checks in auto-remediation
|
|
|
|
---
|
|
|
|
## When Implementing New Features
|
|
|
|
1. Check `TODO.md` first - component might be partially implemented
|
|
2. Follow existing patterns in similar components
|
|
3. Use type hints (mypy is strict)
|
|
4. Use `LLMClient` for AI operations
|
|
5. Use Beanie ORM for database operations
|
|
6. All operations are async (use async/await)
|
|
7. Test in Docker (primary development environment)
|
|
8. Update `TODO.md` when marking components as completed
|
|
|
|
---
|
|
|
|
## Questions? Check These Files
|
|
|
|
- **"How do I configure the LLM provider?"** → `.env.example`, `utils/config.py`, `utils/llm_client.py`
|
|
- **"How does auto-remediation work?"** → `api/reliability.py`, `api/auto_remediation.py`
|
|
- **"What's not implemented yet?"** → `TODO.md` (comprehensive list with estimates)
|
|
- **"How do I run tests/lint?"** → `pyproject.toml` (all commands), this file
|
|
- **"Database schema?"** → `api/models.py` (all Beanie models)
|
|
- **"Docker services?"** → `deploy/docker/docker-compose.dev.yml`, `deploy/docker/README.md`
|
|
- **"API endpoints?"** → `api/main.py`, or http://localhost:8000/api/docs when running
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-10-19
|
|
**Project Status**: 35% complete (Infrastructure done, business logic pending)
|
|
**Next Priority**: CLI tool → Celery workers → Collectors → Generators
|