Initial commit: LLM Automation Docs & Remediation Engine v2.0

Features: - Automated datacenter documentation generation - MCP integration for device connectivity - Auto-remediation engine with safety checks - Multi-factor reliability scoring (0-100%) - Human feedback learning loop - Pattern recognition and continuous improvement - Agentic chat support with AI - API for ticket resolution - Frontend React with Material-UI - CI/CD pipelines (GitLab + Gitea) - Docker & Kubernetes deployment - Complete documentation and guides v2.0 Highlights: - Auto-remediation with write operations (disabled by default) - Reliability calculator with 4-factor scoring - Human feedback system for continuous learning - Pattern-based progressive automation - Approval workflow for critical actions - Full audit trail and rollback capability
2025-10-17 23:47:28 +00:00
commit 1ba5ce851d
89 changed files with 20468 additions and 0 deletions
--- a/AUTO_REMEDIATION_GUIDE.md
+++ b/AUTO_REMEDIATION_GUIDE.md
@@ -0,0 +1,751 @@
+# 🤖 Auto-Remediation System - Complete Documentation
+
+## 📋 Table of Contents
+
+1. [Overview](#overview)
+2. [Safety First Design](#safety-first-design)
+3. [Reliability Scoring System](#reliability-scoring-system)
+4. [Human Feedback Loop](#human-feedback-loop)
+5. [Decision Engine](#decision-engine)
+6. [Auto-Remediation Execution](#auto-remediation-execution)
+7. [Pattern Learning](#pattern-learning)
+8. [API Usage](#api-usage)
+9. [Configuration](#configuration)
+10. [Monitoring & Analytics](#monitoring--analytics)
+
+---
+
+## Overview
+
+The **Auto-Remediation System** enables AI to autonomously resolve infrastructure issues by executing write operations on your systems. This is a **production-grade** implementation with extensive safety checks, human oversight, and continuous learning.
+
+### Key Features
+
+✅ **Safety-First**: Auto-remediation **disabled by default**  
+✅ **Reliability Scoring**: Multi-factor confidence calculation (0-100%)  
+✅ **Human Feedback**: Continuous learning from user feedback  
+✅ **Pattern Recognition**: Learns from similar issues  
+✅ **Approval Workflow**: Critical actions require human approval  
+✅ **Full Audit Trail**: Every action logged with rollback capability  
+✅ **Progressive Automation**: Decisions improve over time based on success rate
+
+---
+
+## Safety First Design
+
+### 🛡️ Default State: DISABLED
+
+```python
+# Example: Ticket submission
+{
+    "ticket_id": "INC-001",
+    "description": "Problem description",
+    "enable_auto_remediation": false  # ← DEFAULT: Disabled
+}
+```
+
+**Auto-remediation must be explicitly enabled for each ticket.**
+
+### Safety Layers
+
+1. **Explicit Enablement**: Must opt-in per ticket
+2. **Reliability Thresholds**: Minimum confidence required
+3. **Action Classification**: Safe vs. Critical operations
+4. **Pre-execution Checks**: System health, backups, rate limits
+5. **Human Approval**: Required for low-reliability or critical actions
+6. **Post-execution Validation**: Verify success
+7. **Rollback Capability**: Undo on failure
+
+### Action Classification
+
+```python
+class RemediationAction(str, enum.Enum):
+    READ_ONLY = "read_only"           # No changes (default)
+    SAFE_WRITE = "safe_write"          # Non-destructive (restart, clear cache)
+    CRITICAL_WRITE = "critical_write"  # Potentially destructive (delete, modify)
+```
+
+**Critical actions ALWAYS require human approval**, regardless of confidence.
+
+---
+
+## Reliability Scoring System
+
+### Multi-Factor Calculation
+
+The reliability score (0-100%) is calculated from **4 components**:
+
+```python
+Reliability Score = (
+    AI Confidence    × 25% +  # Model's own confidence
+    Human Feedback   × 30% +  # Historical feedback quality
+    Success History  × 25% +  # Past resolution success rate
+    Pattern Match    × 20%    # Similarity to known patterns
+)
+```
+
+### Component Details
+
+#### 1. AI Confidence (25%)
+- Direct from Claude Sonnet 4.5
+- Based on documentation quality and analysis certainty
+- Range: 0-1 converted to 0-100%
+
+#### 2. Human Feedback (30%)
+- Weighted by recency (recent feedback = more weight)
+- Considers:
+  - Positive/Negative/Neutral feedback type
+  - Star ratings (1-5)
+  - Resolution accuracy
+  - Action effectiveness
+
+```python
+feedback_score = (
+    positive_feedback_rate × 100 +
+    average_rating / 5 × 100
+) / 2
+```
+
+#### 3. Historical Success (25%)
+- Success rate in same category (last 6 months)
+- Formula: `resolved_tickets / total_tickets × 100`
+
+#### 4. Pattern Match (20%)
+- Similarity to known, resolved patterns
+- Requires ≥3 similar tickets for pattern
+- Boosts score if pattern has positive feedback
+
+### Confidence Levels
+
+| Score Range | Level     | Description |
+|-------------|-----------|-------------|
+| 90-100%     | Very High | Excellent track record, safe to auto-execute |
+| 75-89%      | High      | Good reliability, may require approval |
+| 60-74%      | Medium    | Moderate confidence, approval recommended |
+| 0-59%       | Low       | Low confidence, manual review required |
+
+### Example Breakdown
+
+```json
+{
+  "overall_score": 87.5,
+  "confidence_level": "high",
+  "breakdown": {
+    "ai_confidence": "92%",
+    "human_validation": "85%",
+    "success_history": "90%",
+    "pattern_recognition": "82%"
+  }
+}
+```
+
+---
+
+## Human Feedback Loop
+
+### Feedback Collection
+
+After each ticket resolution, collect structured feedback:
+
+```python
+{
+  "ticket_id": "INC-001",
+  "feedback_type": "positive|negative|neutral",
+  "rating": 5,  # 1-5 stars
+  "was_helpful": true,
+  "resolution_accurate": true,
+  "actions_worked": true,
+  
+  # Optional detailed feedback
+  "comment": "Great resolution!",
+  "what_worked": "The restart fixed it",
+  "what_didnt_work": null,
+  "suggestions": "Could add more details",
+  
+  # If AI failed, what actually worked?
+  "actual_resolution": "Had to increase memory instead",
+  "actual_actions_taken": [...],
+  "time_to_resolve": 30.0  # minutes
+}
+```
+
+### Feedback Impact
+
+1. **Immediate**: Updates ticket reliability score
+2. **Pattern Learning**: Strengthens/weakens pattern eligibility
+3. **Future Decisions**: Influences similar ticket handling
+4. **Auto-remediation Eligibility**: Pattern becomes eligible after:
+   - ≥5 occurrences
+   - ≥85% positive feedback rate
+   - ≥85% average reliability score
+
+### Feedback Analytics
+
+Track feedback trends:
+- Positive/Negative/Neutral distribution
+- Average ratings by category
+- Resolution accuracy trends
+- Action success rates
+
+---
+
+## Decision Engine
+
+### Decision Flow
+
+```
+1. Check: Auto-remediation enabled for ticket?
+   ├─ NO → Skip auto-remediation
+   └─ YES → Continue
+
+2. Get applicable policy for category
+   ├─ No policy → Require manual approval
+   └─ Policy exists → Continue
+
+3. Classify action risk level
+   ├─ READ_ONLY → Low risk
+   ├─ SAFE_WRITE → Medium risk
+   └─ CRITICAL_WRITE → High risk
+
+4. Check confidence & reliability thresholds
+   ├─ Below minimum → Reject
+   └─ Above minimum → Continue
+
+5. Perform safety checks
+   ├─ Pre-checks failed → Reject
+   └─ All passed → Continue
+
+6. Check pattern eligibility
+   ├─ Unknown pattern → Require approval
+   └─ Known good pattern → Continue
+
+7. Determine approval requirement
+   ├─ Reliability ≥ auto_approve_threshold → Auto-approve
+   ├─ Critical action → Require approval
+   └─ Otherwise → Follow policy
+
+8. Execute or await approval
+```
+
+### Decision Example
+
+```json
+{
+  "allowed": true,
+  "action_type": "safe_write",
+  "requires_approval": false,
+  "reasoning": [
+    "All checks passed",
+    "Auto-approved: reliability 92% >= 90%"
+  ],
+  "safety_checks": {
+    "time_window_ok": true,
+    "rate_limit_ok": true,
+    "backup_available": true,
+    "system_healthy": true,
+    "all_passed": true
+  },
+  "risk_level": "medium"
+}
+```
+
+---
+
+## Auto-Remediation Execution
+
+### Execution Flow
+
+```python
+async def execute_remediation(ticket, actions, decision):
+    # 1. Verify decision allows execution
+    if not decision['allowed']:
+        return error
+    
+    # 2. Check approval if required
+    if decision['requires_approval']:
+        if not has_approval(ticket):
+            return "awaiting_approval"
+    
+    # 3. Execute each action with safety
+    for action in actions:
+        # Pre-execution check
+        pre_check = await check_system_health()
+        if not pre_check.passed:
+            rollback()
+            return error
+        
+        # Execute action via MCP
+        result = await execute_via_mcp(action)
+        
+        # Post-execution verification
+        post_check = await verify_success()
+        if not post_check.passed:
+            rollback()
+            return error
+        
+        # Log action
+        log_remediation(action, result)
+    
+    return success
+```
+
+### Supported Operations
+
+#### VMware
+- `restart_vm` - Graceful VM restart
+- `snapshot_vm` - Create snapshot
+- `increase_memory` - Increase VM memory
+- `increase_cpu` - Add vCPUs
+
+#### Kubernetes
+- `restart_pod` - Delete pod (recreate)
+- `scale_deployment` - Change replica count
+- `rollback_deployment` - Rollback to previous version
+
+#### Network
+- `clear_interface_errors` - Clear interface counters
+- `enable_port` - Enable disabled port
+- `restart_interface` - Bounce interface
+
+#### Storage
+- `expand_volume` - Increase volume size
+- `clear_snapshots` - Remove old snapshots
+
+#### OpenStack
+- `reboot_instance` - Soft reboot instance
+- `resize_instance` - Change instance flavor
+
+### Safety Checks
+
+**Pre-execution:**
+- System health check (CPU, memory, disk)
+- Backup availability verification
+- Rate limit check (max 10/hour)
+- Time window check (maintenance hours)
+
+**Post-execution:**
+- Resource health verification
+- Service availability check
+- Performance metrics validation
+
+### Rollback
+
+If any action fails:
+1. Stop execution immediately
+2. Log failure details
+3. Execute rollback procedures
+4. Notify administrators
+5. Update ticket status to `partially_remediated`
+
+---
+
+## Pattern Learning
+
+### Pattern Identification
+
+```python
+# Generate pattern signature
+pattern = {
+    'category': 'network',
+    'key_terms': ['vlan', 'connectivity', 'timeout'],
+    'hash': sha256(signature)
+}
+```
+
+### Pattern Statistics
+
+Tracked for each pattern:
+- **Occurrence count**: How many times seen
+- **Success/failure counts**: Resolution outcomes
+- **Feedback distribution**: Positive/negative/neutral
+- **Average confidence**: Mean AI confidence
+- **Average reliability**: Mean reliability score
+- **Auto-remediation success rate**: % of successful auto-fixes
+
+### Pattern Eligibility
+
+Pattern becomes eligible for auto-remediation when:
+
+```python
+if (
+    pattern.occurrence_count >= 5 and
+    pattern.positive_feedback_rate >= 0.85 and
+    pattern.avg_reliability_score >= 85.0 and
+    pattern.auto_remediation_success_rate >= 0.85
+):
+    pattern.eligible_for_auto_remediation = True
+```
+
+### Pattern Evolution
+
+```
+Initial State:
+├─ occurrence_count: 1
+├─ eligible_for_auto_remediation: false
+└─ Manual resolution only
+
+After 5+ occurrences with good feedback:
+├─ occurrence_count: 7
+├─ positive_feedback_rate: 0.85
+├─ avg_reliability_score: 87.0
+├─ eligible_for_auto_remediation: true
+└─ Can trigger auto-remediation
+
+After 20+ occurrences:
+├─ occurrence_count: 24
+├─ auto_remediation_success_rate: 0.92
+├─ Very high confidence
+└─ Auto-remediation without approval
+```
+
+---
+
+## API Usage
+
+### Create Ticket with Auto-Remediation
+
+```bash
+curl -X POST http://localhost:8000/api/v1/tickets \
+  -H "Content-Type: application/json" \
+  -d '{
+    "ticket_id": "INC-12345",
+    "title": "Service down",
+    "description": "Web service not responding on port 8080",
+    "category": "server",
+    "enable_auto_remediation": true
+  }'
+```
+
+**Response:**
+```json
+{
+  "ticket_id": "INC-12345",
+  "status": "processing",
+  "auto_remediation_enabled": true,
+  "confidence_score": 0.0,
+  "reliability_score": null
+}
+```
+
+### Check Ticket Status
+
+```bash
+curl http://localhost:8000/api/v1/tickets/INC-12345
+```
+
+**Response:**
+```json
+{
+  "ticket_id": "INC-12345",
+  "status": "resolved",
+  "resolution": "Service was restarted successfully...",
+  "suggested_actions": [
+    {"action": "Restart web service", "system": "prod-web-01"}
+  ],
+  "confidence_score": 0.92,
+  "reliability_score": 87.5,
+  "reliability_breakdown": {
+    "overall_score": 87.5,
+    "confidence_level": "high",
+    "breakdown": {...}
+  },
+  "auto_remediation_enabled": true,
+  "auto_remediation_executed": true,
+  "remediation_decision": {
+    "allowed": true,
+    "requires_approval": false,
+    "action_type": "safe_write"
+  },
+  "remediation_results": {
+    "success": true,
+    "executed_actions": [...]
+  }
+}
+```
+
+### Submit Feedback
+
+```bash
+curl -X POST http://localhost:8000/api/v1/feedback \
+  -H "Content-Type: application/json" \
+  -d '{
+    "ticket_id": "INC-12345",
+    "feedback_type": "positive",
+    "rating": 5,
+    "was_helpful": true,
+    "resolution_accurate": true,
+    "actions_worked": true,
+    "comment": "Perfect resolution, service is back up!"
+  }'
+```
+
+### Approve Remediation
+
+For tickets requiring approval:
+
+```bash
+curl -X POST http://localhost:8000/api/v1/tickets/INC-12345/approve-remediation \
+  -H "Content-Type: application/json" \
+  -d '{
+    "ticket_id": "INC-12345",
+    "approve": true,
+    "approver": "john.doe@company.com",
+    "comment": "Approved for execution"
+  }'
+```
+
+### Get Analytics
+
+```bash
+# Reliability statistics
+curl http://localhost:8000/api/v1/stats/reliability?days=30
+
+# Auto-remediation statistics
+curl http://localhost:8000/api/v1/stats/auto-remediation?days=30
+
+# Learned patterns
+curl http://localhost:8000/api/v1/patterns?category=network&min_occurrences=5
+```
+
+---
+
+## Configuration
+
+### Auto-Remediation Policy
+
+```python
+policy = AutoRemediationPolicy(
+    name="network-auto-remediation",
+    category="network",
+    
+    # Thresholds
+    min_confidence_score=0.85,      # 85% AI confidence required
+    min_reliability_score=80.0,     # 80% reliability required
+    min_similar_tickets=5,          # Need 5+ similar resolved tickets
+    min_positive_feedback_rate=0.8, # 80% positive feedback required
+    
+    # Allowed actions
+    allowed_action_types=["safe_write"],
+    allowed_systems=["network"],
+    forbidden_commands=["delete", "format", "shutdown"],
+    
+    # Time restrictions
+    allowed_hours_start=22,  # 10 PM
+    allowed_hours_end=6,     # 6 AM
+    allowed_days=["monday", "tuesday", "wednesday", "thursday", "friday"],
+    
+    # Approval
+    requires_approval=True,
+    auto_approve_threshold=90.0,  # Auto-approve if reliability ≥ 90%
+    approvers=["admin@company.com"],
+    
+    # Safety
+    max_actions_per_hour=10,
+    requires_rollback_plan=True,
+    requires_backup=True,
+    
+    # Status
+    enabled=True
+)
+```
+
+### Environment Variables
+
+```bash
+# Enable/disable auto-remediation globally
+AUTO_REMEDIATION_ENABLED=true
+
+# Global safety settings
+AUTO_REMEDIATION_MAX_ACTIONS_PER_HOUR=10
+AUTO_REMEDIATION_REQUIRE_APPROVAL=true
+AUTO_REMEDIATION_MIN_RELIABILITY=85.0
+
+# Pattern learning
+PATTERN_MIN_OCCURRENCES=5
+PATTERN_MIN_POSITIVE_RATE=0.85
+```
+
+---
+
+## Monitoring & Analytics
+
+### Key Metrics
+
+```python
+# Reliability metrics
+- avg_reliability_score: Average across all tickets
+- avg_confidence_score: Average AI confidence
+- resolution_rate: % of tickets resolved
+
+# Auto-remediation metrics
+- execution_rate: % of enabled tickets that were auto-remediated
+- success_rate: % of auto-remediation actions that succeeded
+- approval_rate: % requiring human approval
+
+# Feedback metrics
+- positive_feedback_rate: % positive feedback
+- negative_feedback_rate: % negative feedback
+- avg_rating: Average star rating (1-5)
+
+# Pattern metrics
+- eligible_patterns: # of patterns eligible for auto-remediation
+- pattern_success_rate: Success rate across all patterns
+```
+
+### Grafana Dashboards
+
+Example metrics:
+
+```promql
+# Reliability score trend
+avg(datacenter_docs_reliability_score) by (category)
+
+# Auto-remediation success rate
+rate(datacenter_docs_auto_remediation_success_total[1h]) /
+rate(datacenter_docs_auto_remediation_attempts_total[1h])
+
+# Feedback sentiment
+sum(datacenter_docs_feedback_total) by (type)
+```
+
+### Alerts
+
+```yaml
+# Low reliability alert
+- alert: LowReliabilityScore
+  expr: avg(datacenter_docs_reliability_score) < 70
+  for: 1h
+  annotations:
+    summary: "Reliability score below threshold"
+
+# High failure rate
+- alert: HighAutoRemediationFailureRate
+  expr: rate(datacenter_docs_auto_remediation_failures_total[1h]) > 0.2
+  for: 15m
+  annotations:
+    summary: "Auto-remediation failure rate > 20%"
+```
+
+---
+
+## Best Practices
+
+### 1. Start Conservative
+
+- Enable auto-remediation for **low-risk categories** first (e.g., cache clearing)
+- Set high thresholds initially (reliability ≥ 90%)
+- Require approvals for first 20-30 occurrences
+- Monitor closely and adjust based on results
+
+### 2. Gradual Rollout
+
+```
+Week 1-2: Enable for 5% of tickets
+Week 3-4: Increase to 20% if success rate > 90%
+Week 5-6: Increase to 50% if success rate > 85%
+Week 7+:  Full rollout with dynamic thresholds
+```
+
+### 3. Category-Specific Policies
+
+Different categories need different thresholds:
+
+| Category | Min Reliability | Auto-Approve | Reason |
+|----------|----------------|--------------|--------|
+| Cache | 75% | 85% | Low risk, frequent |
+| Network | 85% | 90% | Medium risk |
+| Storage | 90% | 95% | High risk |
+| Security | 95% | Never | Critical, always approve |
+
+### 4. Human in the Loop
+
+- Always collect feedback, even for successful auto-remediations
+- Review logs weekly
+- Adjust thresholds based on feedback trends
+- Disable patterns with declining success rates
+
+### 5. Continuous Learning
+
+- System improves over time through feedback
+- Patterns with 20+ occurrences and 90%+ success → Very high confidence
+- Allow system to become more autonomous as reliability proves out
+- But maintain human oversight for critical operations
+
+---
+
+## Troubleshooting
+
+### Auto-remediation not executing
+
+**Check:**
+1. Is `enable_auto_remediation: true` in ticket?
+2. Is there an active policy for the category?
+3. Does confidence/reliability meet thresholds?
+4. Are safety checks passing?
+5. Does pattern meet eligibility requirements?
+
+**Debug:**
+```bash
+# Check decision
+curl http://localhost:8000/api/v1/tickets/TICKET-ID | jq '.remediation_decision'
+
+# Check logs
+curl http://localhost:8000/api/v1/tickets/TICKET-ID/remediation-logs
+```
+
+### Low reliability scores
+
+**Causes:**
+- Insufficient historical data
+- Negative feedback on category
+- Low pattern match confidence
+- Recent failures in category
+
+**Solutions:**
+- Collect more feedback
+- Review and improve resolutions
+- Wait for more data points
+- Manually resolve similar tickets successfully
+
+### Pattern not becoming eligible
+
+**Requirements not met:**
+- Need ≥5 occurrences
+- Need ≥85% positive feedback
+- Need ≥85% average reliability
+
+**Action:**
+- Continue resolving similar tickets
+- Ensure feedback is being collected
+- Check pattern stats: `GET /api/v1/patterns`
+
+---
+
+## Future Enhancements
+
+- **Multi-step reasoning**: Complex workflows spanning multiple systems
+- **Predictive remediation**: Fix issues before they cause incidents
+- **A/B testing**: Compare different resolution strategies
+- **Reinforcement learning**: Optimize actions based on outcomes
+- **Natural language explanations**: Better transparency in decisions
+- **Cross-system orchestration**: Coordinated actions across infrastructure
+
+---
+
+## Summary
+
+The **Auto-Remediation System** is designed for **safe, gradual automation** of infrastructure issue resolution:
+
+1. ✅ **Disabled by default** - explicit opt-in per ticket
+2. ✅ **Multi-factor reliability** - comprehensive confidence calculation
+3. ✅ **Human feedback loop** - continuous learning and improvement
+4. ✅ **Pattern recognition** - learns from similar issues
+5. ✅ **Safety first** - extensive checks, approval workflows, rollback
+6. ✅ **Progressive automation** - system becomes more autonomous over time
+7. ✅ **Full observability** - complete audit trail and analytics
+
+**Start small, monitor closely, scale gradually, and let the system learn.**
+
+---
+
+For support: automation-team@company.local