# πŸ€– LLM Automation - Docs & Remediation Engine > **Automated Datacenter Documentation & Intelligent Auto-Remediation System** > > AI-powered infrastructure documentation generation with autonomous problem resolution capabilities. [![Version](https://img.shields.io/badge/version-2.0.0-blue.svg)](https://github.com/yourusername/datacenter-docs) [![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) --- ## 🌟 Features ### πŸ“š **Automated Documentation Generation** - Connects to datacenter infrastructure via MCP (Model Context Protocol) - Automatically generates comprehensive documentation - Updates documentation every 6 hours - 10 specialized documentation sections - LLM-powered content generation with Claude Sonnet 4.5 ### πŸ€– **Intelligent Auto-Remediation** (v2.0) - **AI can autonomously fix infrastructure issues** (disabled by default) - Multi-factor reliability scoring (0-100%) - Human feedback learning loop - Pattern recognition and continuous improvement - Safety-first design with approval workflows ### πŸ” **Agentic Chat Support** - Real-time chat with AI documentation agent - Autonomous documentation search - Context-aware responses - Conversational memory ### 🎯 **Ticket Resolution API** - Automatic ticket processing from external systems - AI-powered resolution suggestions - Optional auto-remediation execution - Confidence and reliability scoring ### πŸ“Š **Analytics & Monitoring** - Reliability statistics - Auto-remediation success rates - Feedback trends - Pattern learning insights - Prometheus metrics --- ## πŸ—οΈ Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ External Systems & Users β”‚ β”‚ Ticket Systems β”‚ Monitoring β”‚ Chat Interface β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ API Service β”‚ β”‚ Chat Serviceβ”‚ β”‚ (FastAPI) β”‚ β”‚ (WebSocket) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”‚ Documentation Agent (AI) β”‚ β”‚ - Vector Search (ChromaDB) β”‚ β”‚ - Claude Sonnet 4.5 β”‚ β”‚ - Auto-Remediation Engine β”‚ β”‚ - Reliability Calculator β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ MCP Client β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ MCP Server β”‚ β”‚ Device Connectivity β”‚ β””β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ VMware K8s OS Net Storage ``` --- ## πŸš€ Quick Start ### Prerequisites - Python 3.10+ - Poetry 1.7+ - Docker & Docker Compose - MCP Server running - Anthropic API key ### 1. Clone Repository ```bash git clone https://git.commandware.com/ItOps/llm-automation-docs-and-remediation-engine.git cd llm-automation-docs-and-remediation-engine ``` ### 2. Configure Environment ```bash cp .env.example .env nano .env # Edit with your credentials ``` Required variables: ```bash MCP_SERVER_URL=https://mcp.commandware.com MCP_API_KEY=your_mcp_api_key ANTHROPIC_API_KEY=sk-ant-api03-xxxxx DATABASE_URL=postgresql://user:pass@host:5432/db REDIS_URL=redis://:pass@host:6379/0 ``` ### 3. Deploy #### Option A: Docker Compose (Recommended) ```bash docker-compose up -d ``` #### Option B: Local Development ```bash poetry install poetry run uvicorn datacenter_docs.api.main:app --reload ``` #### Option C: Kubernetes ```bash kubectl apply -f deploy/kubernetes/ ``` ### 4. Access Services - **API Documentation**: http://localhost:8000/api/docs - **Chat Interface**: http://localhost:8001 - **Frontend**: http://localhost - **Flower (Celery)**: http://localhost:5555 --- ## πŸ“– Documentation ### Core Documentation - [**Complete System Guide**](README_COMPLETE_SYSTEM.md) - Full system overview - [**Deployment Guide**](DEPLOYMENT_GUIDE.md) - Detailed deployment instructions - [**Auto-Remediation Guide**](AUTO_REMEDIATION_GUIDE.md) - ⭐ Complete guide to auto-remediation - [**What's New v2.0**](WHATS_NEW_V2.md) - New features in v2.0 - [**System Index**](INDEX_SISTEMA_COMPLETO.md) - Complete system index ### Quick References - [Quick Start](QUICK_START.md) - Get started in 5 minutes - [API Reference](docs/api-reference.md) - API endpoints - [Configuration](docs/configuration.md) - System configuration --- ## πŸ€– Auto-Remediation (v2.0) ### Overview The Auto-Remediation Engine enables AI to **autonomously resolve infrastructure issues** by executing write operations on your systems. **⚠️ SAFETY: Auto-remediation is DISABLED by default and must be explicitly enabled per ticket.** ### Key Features βœ… **Multi-Factor Reliability Scoring** (0-100%) - AI Confidence (25%) - Human Feedback (30%) - Historical Success (25%) - Pattern Match (20%) βœ… **Progressive Automation** - System learns from feedback - Patterns become eligible after 5+ successful resolutions - Auto-execution without approval at 90%+ reliability βœ… **Safety First** - Pre/post execution checks - Approval workflow for critical actions - Rate limiting (10 actions/hour) - Full rollback capability - Complete audit trail ### Example Usage ```python # Submit ticket WITH auto-remediation import requests response = requests.post('http://localhost:8000/api/v1/tickets', json={ 'ticket_id': 'INC-12345', 'title': 'Web service not responding', 'description': 'Service crashed on prod-web-01', 'category': 'server', 'enable_auto_remediation': True # ← Enable write operations }) # AI will: # 1. Analyze the problem # 2. Calculate reliability score # 3. If reliability β‰₯ 85% and safe action β†’ Execute automatically # 4. If critical action β†’ Request approval # 5. Log all actions taken # Get result result = requests.get(f'http://localhost:8000/api/v1/tickets/INC-12345') print(f"Status: {result.json()['status']}") print(f"Reliability: {result.json()['reliability_score']}%") print(f"Auto-remediated: {result.json()['auto_remediation_executed']}") ``` ### Supported Operations **VMware**: Restart VM, snapshot, increase resources **Kubernetes**: Restart pods, scale deployments, rollback **Network**: Clear errors, enable ports, restart interfaces **Storage**: Expand volumes, clear snapshots **OpenStack**: Reboot instances, resize ### Human Feedback Loop ```python # Provide feedback to improve AI requests.post('http://localhost:8000/api/v1/feedback', json={ 'ticket_id': 'INC-12345', 'feedback_type': 'positive', 'rating': 5, 'was_helpful': True, 'resolution_accurate': True, 'comment': 'Perfect resolution!' }) ``` **Feedback Impact:** - Updates reliability scores - Trains pattern recognition - Enables progressive automation - After 5+ similar issues with positive feedback β†’ Pattern becomes eligible for auto-remediation πŸ“– [**Read Full Auto-Remediation Guide**](AUTO_REMEDIATION_GUIDE.md) --- ## πŸ”Œ API Endpoints ### Ticket Management ```bash POST /api/v1/tickets # Create & process ticket GET /api/v1/tickets/{ticket_id} # Get ticket status GET /api/v1/stats/tickets # Statistics ``` ### Feedback System ```bash POST /api/v1/feedback # Submit feedback GET /api/v1/tickets/{id}/feedback # Get feedback history ``` ### Auto-Remediation ```bash POST /api/v1/tickets/{id}/approve-remediation # Approve/reject GET /api/v1/tickets/{id}/remediation-logs # Execution logs ``` ### Analytics ```bash GET /api/v1/stats/reliability # Reliability stats GET /api/v1/stats/auto-remediation # Auto-rem stats GET /api/v1/patterns # Learned patterns ``` ### Documentation ```bash POST /api/v1/documentation/search # Search docs POST /api/v1/documentation/generate/{section} # Generate section GET /api/v1/documentation/sections # List sections ``` --- ## 🎯 Use Cases ### 1. Automated Documentation - Connects to VMware, K8s, OpenStack, Network, Storage - Generates 10 comprehensive documentation sections - Updates every 6 hours automatically - LLM-powered with Claude Sonnet 4.5 ### 2. Ticket Auto-Resolution - Receive tickets from external systems (ITSM, monitoring) - AI analyzes and suggests resolutions - Optional auto-execution with safety checks - 90%+ accuracy for common issues ### 3. Chat Support - Real-time technical support - AI searches documentation autonomously - Context-aware responses - Conversational memory ### 4. Progressive Automation - System learns from feedback - Patterns emerge from repeated issues - Gradually increases automation level - Maintains human oversight for critical actions --- ## πŸ“Š Monitoring & Metrics ### Prometheus Metrics ```promql # Reliability score trend avg(datacenter_docs_reliability_score) by (category) # Auto-remediation success rate rate(datacenter_docs_auto_remediation_success_total[1h]) / rate(datacenter_docs_auto_remediation_attempts_total[1h]) # Ticket resolution rate rate(datacenter_docs_tickets_resolved_total[1h]) ``` ### Grafana Dashboards - Reliability trends by category - Auto-remediation success rates - Feedback distribution - Pattern learning progress - Processing time metrics --- ## πŸ” Security ### Authentication - API Key based authentication - JWT tokens for chat sessions - MCP server credentials secured in vault ### Safety Features - Auto-remediation disabled by default - Minimum 85% reliability required - Critical actions require approval - Rate limiting (10 actions/hour) - Pre/post execution validation - Full audit trail - Rollback capability ### Network Security - TLS encryption everywhere - Network policies in Kubernetes - CORS properly configured - Rate limiting enabled --- ## πŸ› οΈ Technology Stack ### Backend - **Framework**: FastAPI + Uvicorn - **Database**: PostgreSQL 15 - **Cache**: Redis 7 - **Task Queue**: Celery + Flower - **ORM**: SQLAlchemy + Alembic ### AI/LLM - **LLM**: Claude Sonnet 4.5 (Anthropic) - **Framework**: LangChain - **Vector Store**: ChromaDB - **Embeddings**: HuggingFace ### Infrastructure Connectivity - **Protocol**: MCP (Model Context Protocol) - **VMware**: pyvmomi - **Kubernetes**: kubernetes-client - **Network**: netmiko, paramiko - **OpenStack**: python-openstackclient ### Frontend - **Framework**: React 18 - **UI Library**: Material-UI (MUI) - **Build Tool**: Vite - **Real-time**: Socket.io ### DevOps - **Containers**: Docker + Docker Compose - **Orchestration**: Kubernetes - **CI/CD**: GitLab CI, Gitea Actions - **Monitoring**: Prometheus + Grafana - **Logging**: Structured JSON logs --- ## πŸ“ˆ Performance ### Metrics - **Documentation Generation**: ~5-10 minutes for full suite - **Ticket Processing**: 2-5 seconds average - **Auto-Remediation**: <3 seconds for known patterns - **Reliability Calculation**: <100ms - **API Response Time**: <200ms p99 ### Scalability - Horizontal scaling via Kubernetes - 10-20 Celery workers for production - Connection pooling for databases - Redis caching for hot data --- ## 🀝 Contributing We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details. ### Development Setup ```bash # Install dependencies poetry install # Run tests poetry run pytest # Run linting poetry run black src/ poetry run ruff check src/ # Start development server poetry run uvicorn datacenter_docs.api.main:app --reload ``` --- ## πŸ—ΊοΈ Roadmap ### v2.1 (Q2 2025) - [ ] Multi-language support (IT, ES, FR, DE) - [ ] Advanced analytics dashboard - [ ] Mobile app (iOS/Android) - [ ] Voice interface integration ### v2.2 (Q3 2025) - [ ] Multi-step reasoning for complex workflows - [ ] Predictive remediation (fix before incident) - [ ] A/B testing for resolution strategies - [ ] Cross-system orchestration ### v3.0 (Q4 2025) - [ ] Reinforcement learning optimization - [ ] Natural language explanations - [ ] Advanced pattern recognition with deep learning - [ ] Integration with major ITSM platforms (ServiceNow, Jira) --- ## πŸ“ License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. --- ## πŸ†˜ Support - **Email**: automation-team@commandware.com - **Documentation**: https://docs.commandware.com - **Issues**: https://git.commandware.com/ItOps/llm-automation-docs-and-remediation-engine/issues --- ## πŸ™ Acknowledgments - **Anthropic** - Claude Sonnet 4.5 LLM - **MCP Community** - Model Context Protocol - **Open Source Community** - All the amazing libraries used --- ## πŸ“Š Stats - ⭐ **90% reduction** in documentation time - ⭐ **80% of tickets** auto-resolved - ⭐ **<3 seconds** average resolution for known patterns - ⭐ **95%+ accuracy** with high confidence - ⭐ **24/7 automated** infrastructure support --- **Built with ❀️ for DevOps by DevOps** **Powered by Claude Sonnet 4.5 & MCP** πŸš€