# 🤖 Auto-Remediation System - Complete Documentation ## 📋 Table of Contents 1. [Overview](#overview) 2. [Safety First Design](#safety-first-design) 3. [Reliability Scoring System](#reliability-scoring-system) 4. [Human Feedback Loop](#human-feedback-loop) 5. [Decision Engine](#decision-engine) 6. [Auto-Remediation Execution](#auto-remediation-execution) 7. [Pattern Learning](#pattern-learning) 8. [API Usage](#api-usage) 9. [Configuration](#configuration) 10. [Monitoring & Analytics](#monitoring--analytics) --- ## Overview The **Auto-Remediation System** enables AI to autonomously resolve infrastructure issues by executing write operations on your systems. This is a **production-grade** implementation with extensive safety checks, human oversight, and continuous learning. ### Key Features ✅ **Safety-First**: Auto-remediation **disabled by default** ✅ **Reliability Scoring**: Multi-factor confidence calculation (0-100%) ✅ **Human Feedback**: Continuous learning from user feedback ✅ **Pattern Recognition**: Learns from similar issues ✅ **Approval Workflow**: Critical actions require human approval ✅ **Full Audit Trail**: Every action logged with rollback capability ✅ **Progressive Automation**: Decisions improve over time based on success rate --- ## Safety First Design ### 🛡️ Default State: DISABLED ```python # Example: Ticket submission { "ticket_id": "INC-001", "description": "Problem description", "enable_auto_remediation": false # ← DEFAULT: Disabled } ``` **Auto-remediation must be explicitly enabled for each ticket.** ### Safety Layers 1. **Explicit Enablement**: Must opt-in per ticket 2. **Reliability Thresholds**: Minimum confidence required 3. **Action Classification**: Safe vs. Critical operations 4. **Pre-execution Checks**: System health, backups, rate limits 5. **Human Approval**: Required for low-reliability or critical actions 6. **Post-execution Validation**: Verify success 7. **Rollback Capability**: Undo on failure ### Action Classification ```python class RemediationAction(str, enum.Enum): READ_ONLY = "read_only" # No changes (default) SAFE_WRITE = "safe_write" # Non-destructive (restart, clear cache) CRITICAL_WRITE = "critical_write" # Potentially destructive (delete, modify) ``` **Critical actions ALWAYS require human approval**, regardless of confidence. --- ## Reliability Scoring System ### Multi-Factor Calculation The reliability score (0-100%) is calculated from **4 components**: ```python Reliability Score = ( AI Confidence × 25% + # Model's own confidence Human Feedback × 30% + # Historical feedback quality Success History × 25% + # Past resolution success rate Pattern Match × 20% # Similarity to known patterns ) ``` ### Component Details #### 1. AI Confidence (25%) - Direct from Claude Sonnet 4.5 - Based on documentation quality and analysis certainty - Range: 0-1 converted to 0-100% #### 2. Human Feedback (30%) - Weighted by recency (recent feedback = more weight) - Considers: - Positive/Negative/Neutral feedback type - Star ratings (1-5) - Resolution accuracy - Action effectiveness ```python feedback_score = ( positive_feedback_rate × 100 + average_rating / 5 × 100 ) / 2 ``` #### 3. Historical Success (25%) - Success rate in same category (last 6 months) - Formula: `resolved_tickets / total_tickets × 100` #### 4. Pattern Match (20%) - Similarity to known, resolved patterns - Requires ≥3 similar tickets for pattern - Boosts score if pattern has positive feedback ### Confidence Levels | Score Range | Level | Description | |-------------|-----------|-------------| | 90-100% | Very High | Excellent track record, safe to auto-execute | | 75-89% | High | Good reliability, may require approval | | 60-74% | Medium | Moderate confidence, approval recommended | | 0-59% | Low | Low confidence, manual review required | ### Example Breakdown ```json { "overall_score": 87.5, "confidence_level": "high", "breakdown": { "ai_confidence": "92%", "human_validation": "85%", "success_history": "90%", "pattern_recognition": "82%" } } ``` --- ## Human Feedback Loop ### Feedback Collection After each ticket resolution, collect structured feedback: ```python { "ticket_id": "INC-001", "feedback_type": "positive|negative|neutral", "rating": 5, # 1-5 stars "was_helpful": true, "resolution_accurate": true, "actions_worked": true, # Optional detailed feedback "comment": "Great resolution!", "what_worked": "The restart fixed it", "what_didnt_work": null, "suggestions": "Could add more details", # If AI failed, what actually worked? "actual_resolution": "Had to increase memory instead", "actual_actions_taken": [...], "time_to_resolve": 30.0 # minutes } ``` ### Feedback Impact 1. **Immediate**: Updates ticket reliability score 2. **Pattern Learning**: Strengthens/weakens pattern eligibility 3. **Future Decisions**: Influences similar ticket handling 4. **Auto-remediation Eligibility**: Pattern becomes eligible after: - ≥5 occurrences - ≥85% positive feedback rate - ≥85% average reliability score ### Feedback Analytics Track feedback trends: - Positive/Negative/Neutral distribution - Average ratings by category - Resolution accuracy trends - Action success rates --- ## Decision Engine ### Decision Flow ``` 1. Check: Auto-remediation enabled for ticket? ├─ NO → Skip auto-remediation └─ YES → Continue 2. Get applicable policy for category ├─ No policy → Require manual approval └─ Policy exists → Continue 3. Classify action risk level ├─ READ_ONLY → Low risk ├─ SAFE_WRITE → Medium risk └─ CRITICAL_WRITE → High risk 4. Check confidence & reliability thresholds ├─ Below minimum → Reject └─ Above minimum → Continue 5. Perform safety checks ├─ Pre-checks failed → Reject └─ All passed → Continue 6. Check pattern eligibility ├─ Unknown pattern → Require approval └─ Known good pattern → Continue 7. Determine approval requirement ├─ Reliability ≥ auto_approve_threshold → Auto-approve ├─ Critical action → Require approval └─ Otherwise → Follow policy 8. Execute or await approval ``` ### Decision Example ```json { "allowed": true, "action_type": "safe_write", "requires_approval": false, "reasoning": [ "All checks passed", "Auto-approved: reliability 92% >= 90%" ], "safety_checks": { "time_window_ok": true, "rate_limit_ok": true, "backup_available": true, "system_healthy": true, "all_passed": true }, "risk_level": "medium" } ``` --- ## Auto-Remediation Execution ### Execution Flow ```python async def execute_remediation(ticket, actions, decision): # 1. Verify decision allows execution if not decision['allowed']: return error # 2. Check approval if required if decision['requires_approval']: if not has_approval(ticket): return "awaiting_approval" # 3. Execute each action with safety for action in actions: # Pre-execution check pre_check = await check_system_health() if not pre_check.passed: rollback() return error # Execute action via MCP result = await execute_via_mcp(action) # Post-execution verification post_check = await verify_success() if not post_check.passed: rollback() return error # Log action log_remediation(action, result) return success ``` ### Supported Operations #### VMware - `restart_vm` - Graceful VM restart - `snapshot_vm` - Create snapshot - `increase_memory` - Increase VM memory - `increase_cpu` - Add vCPUs #### Kubernetes - `restart_pod` - Delete pod (recreate) - `scale_deployment` - Change replica count - `rollback_deployment` - Rollback to previous version #### Network - `clear_interface_errors` - Clear interface counters - `enable_port` - Enable disabled port - `restart_interface` - Bounce interface #### Storage - `expand_volume` - Increase volume size - `clear_snapshots` - Remove old snapshots #### OpenStack - `reboot_instance` - Soft reboot instance - `resize_instance` - Change instance flavor ### Safety Checks **Pre-execution:** - System health check (CPU, memory, disk) - Backup availability verification - Rate limit check (max 10/hour) - Time window check (maintenance hours) **Post-execution:** - Resource health verification - Service availability check - Performance metrics validation ### Rollback If any action fails: 1. Stop execution immediately 2. Log failure details 3. Execute rollback procedures 4. Notify administrators 5. Update ticket status to `partially_remediated` --- ## Pattern Learning ### Pattern Identification ```python # Generate pattern signature pattern = { 'category': 'network', 'key_terms': ['vlan', 'connectivity', 'timeout'], 'hash': sha256(signature) } ``` ### Pattern Statistics Tracked for each pattern: - **Occurrence count**: How many times seen - **Success/failure counts**: Resolution outcomes - **Feedback distribution**: Positive/negative/neutral - **Average confidence**: Mean AI confidence - **Average reliability**: Mean reliability score - **Auto-remediation success rate**: % of successful auto-fixes ### Pattern Eligibility Pattern becomes eligible for auto-remediation when: ```python if ( pattern.occurrence_count >= 5 and pattern.positive_feedback_rate >= 0.85 and pattern.avg_reliability_score >= 85.0 and pattern.auto_remediation_success_rate >= 0.85 ): pattern.eligible_for_auto_remediation = True ``` ### Pattern Evolution ``` Initial State: ├─ occurrence_count: 1 ├─ eligible_for_auto_remediation: false └─ Manual resolution only After 5+ occurrences with good feedback: ├─ occurrence_count: 7 ├─ positive_feedback_rate: 0.85 ├─ avg_reliability_score: 87.0 ├─ eligible_for_auto_remediation: true └─ Can trigger auto-remediation After 20+ occurrences: ├─ occurrence_count: 24 ├─ auto_remediation_success_rate: 0.92 ├─ Very high confidence └─ Auto-remediation without approval ``` --- ## API Usage ### Create Ticket with Auto-Remediation ```bash curl -X POST http://localhost:8000/api/v1/tickets \ -H "Content-Type: application/json" \ -d '{ "ticket_id": "INC-12345", "title": "Service down", "description": "Web service not responding on port 8080", "category": "server", "enable_auto_remediation": true }' ``` **Response:** ```json { "ticket_id": "INC-12345", "status": "processing", "auto_remediation_enabled": true, "confidence_score": 0.0, "reliability_score": null } ``` ### Check Ticket Status ```bash curl http://localhost:8000/api/v1/tickets/INC-12345 ``` **Response:** ```json { "ticket_id": "INC-12345", "status": "resolved", "resolution": "Service was restarted successfully...", "suggested_actions": [ {"action": "Restart web service", "system": "prod-web-01"} ], "confidence_score": 0.92, "reliability_score": 87.5, "reliability_breakdown": { "overall_score": 87.5, "confidence_level": "high", "breakdown": {...} }, "auto_remediation_enabled": true, "auto_remediation_executed": true, "remediation_decision": { "allowed": true, "requires_approval": false, "action_type": "safe_write" }, "remediation_results": { "success": true, "executed_actions": [...] } } ``` ### Submit Feedback ```bash curl -X POST http://localhost:8000/api/v1/feedback \ -H "Content-Type: application/json" \ -d '{ "ticket_id": "INC-12345", "feedback_type": "positive", "rating": 5, "was_helpful": true, "resolution_accurate": true, "actions_worked": true, "comment": "Perfect resolution, service is back up!" }' ``` ### Approve Remediation For tickets requiring approval: ```bash curl -X POST http://localhost:8000/api/v1/tickets/INC-12345/approve-remediation \ -H "Content-Type: application/json" \ -d '{ "ticket_id": "INC-12345", "approve": true, "approver": "john.doe@company.com", "comment": "Approved for execution" }' ``` ### Get Analytics ```bash # Reliability statistics curl http://localhost:8000/api/v1/stats/reliability?days=30 # Auto-remediation statistics curl http://localhost:8000/api/v1/stats/auto-remediation?days=30 # Learned patterns curl http://localhost:8000/api/v1/patterns?category=network&min_occurrences=5 ``` --- ## Configuration ### Auto-Remediation Policy ```python policy = AutoRemediationPolicy( name="network-auto-remediation", category="network", # Thresholds min_confidence_score=0.85, # 85% AI confidence required min_reliability_score=80.0, # 80% reliability required min_similar_tickets=5, # Need 5+ similar resolved tickets min_positive_feedback_rate=0.8, # 80% positive feedback required # Allowed actions allowed_action_types=["safe_write"], allowed_systems=["network"], forbidden_commands=["delete", "format", "shutdown"], # Time restrictions allowed_hours_start=22, # 10 PM allowed_hours_end=6, # 6 AM allowed_days=["monday", "tuesday", "wednesday", "thursday", "friday"], # Approval requires_approval=True, auto_approve_threshold=90.0, # Auto-approve if reliability ≥ 90% approvers=["admin@company.com"], # Safety max_actions_per_hour=10, requires_rollback_plan=True, requires_backup=True, # Status enabled=True ) ``` ### Environment Variables ```bash # Enable/disable auto-remediation globally AUTO_REMEDIATION_ENABLED=true # Global safety settings AUTO_REMEDIATION_MAX_ACTIONS_PER_HOUR=10 AUTO_REMEDIATION_REQUIRE_APPROVAL=true AUTO_REMEDIATION_MIN_RELIABILITY=85.0 # Pattern learning PATTERN_MIN_OCCURRENCES=5 PATTERN_MIN_POSITIVE_RATE=0.85 ``` --- ## Monitoring & Analytics ### Key Metrics ```python # Reliability metrics - avg_reliability_score: Average across all tickets - avg_confidence_score: Average AI confidence - resolution_rate: % of tickets resolved # Auto-remediation metrics - execution_rate: % of enabled tickets that were auto-remediated - success_rate: % of auto-remediation actions that succeeded - approval_rate: % requiring human approval # Feedback metrics - positive_feedback_rate: % positive feedback - negative_feedback_rate: % negative feedback - avg_rating: Average star rating (1-5) # Pattern metrics - eligible_patterns: # of patterns eligible for auto-remediation - pattern_success_rate: Success rate across all patterns ``` ### Grafana Dashboards Example metrics: ```promql # Reliability score trend avg(datacenter_docs_reliability_score) by (category) # Auto-remediation success rate rate(datacenter_docs_auto_remediation_success_total[1h]) / rate(datacenter_docs_auto_remediation_attempts_total[1h]) # Feedback sentiment sum(datacenter_docs_feedback_total) by (type) ``` ### Alerts ```yaml # Low reliability alert - alert: LowReliabilityScore expr: avg(datacenter_docs_reliability_score) < 70 for: 1h annotations: summary: "Reliability score below threshold" # High failure rate - alert: HighAutoRemediationFailureRate expr: rate(datacenter_docs_auto_remediation_failures_total[1h]) > 0.2 for: 15m annotations: summary: "Auto-remediation failure rate > 20%" ``` --- ## Best Practices ### 1. Start Conservative - Enable auto-remediation for **low-risk categories** first (e.g., cache clearing) - Set high thresholds initially (reliability ≥ 90%) - Require approvals for first 20-30 occurrences - Monitor closely and adjust based on results ### 2. Gradual Rollout ``` Week 1-2: Enable for 5% of tickets Week 3-4: Increase to 20% if success rate > 90% Week 5-6: Increase to 50% if success rate > 85% Week 7+: Full rollout with dynamic thresholds ``` ### 3. Category-Specific Policies Different categories need different thresholds: | Category | Min Reliability | Auto-Approve | Reason | |----------|----------------|--------------|--------| | Cache | 75% | 85% | Low risk, frequent | | Network | 85% | 90% | Medium risk | | Storage | 90% | 95% | High risk | | Security | 95% | Never | Critical, always approve | ### 4. Human in the Loop - Always collect feedback, even for successful auto-remediations - Review logs weekly - Adjust thresholds based on feedback trends - Disable patterns with declining success rates ### 5. Continuous Learning - System improves over time through feedback - Patterns with 20+ occurrences and 90%+ success → Very high confidence - Allow system to become more autonomous as reliability proves out - But maintain human oversight for critical operations --- ## Troubleshooting ### Auto-remediation not executing **Check:** 1. Is `enable_auto_remediation: true` in ticket? 2. Is there an active policy for the category? 3. Does confidence/reliability meet thresholds? 4. Are safety checks passing? 5. Does pattern meet eligibility requirements? **Debug:** ```bash # Check decision curl http://localhost:8000/api/v1/tickets/TICKET-ID | jq '.remediation_decision' # Check logs curl http://localhost:8000/api/v1/tickets/TICKET-ID/remediation-logs ``` ### Low reliability scores **Causes:** - Insufficient historical data - Negative feedback on category - Low pattern match confidence - Recent failures in category **Solutions:** - Collect more feedback - Review and improve resolutions - Wait for more data points - Manually resolve similar tickets successfully ### Pattern not becoming eligible **Requirements not met:** - Need ≥5 occurrences - Need ≥85% positive feedback - Need ≥85% average reliability **Action:** - Continue resolving similar tickets - Ensure feedback is being collected - Check pattern stats: `GET /api/v1/patterns` --- ## Future Enhancements - **Multi-step reasoning**: Complex workflows spanning multiple systems - **Predictive remediation**: Fix issues before they cause incidents - **A/B testing**: Compare different resolution strategies - **Reinforcement learning**: Optimize actions based on outcomes - **Natural language explanations**: Better transparency in decisions - **Cross-system orchestration**: Coordinated actions across infrastructure --- ## Summary The **Auto-Remediation System** is designed for **safe, gradual automation** of infrastructure issue resolution: 1. ✅ **Disabled by default** - explicit opt-in per ticket 2. ✅ **Multi-factor reliability** - comprehensive confidence calculation 3. ✅ **Human feedback loop** - continuous learning and improvement 4. ✅ **Pattern recognition** - learns from similar issues 5. ✅ **Safety first** - extensive checks, approval workflows, rollback 6. ✅ **Progressive automation** - system becomes more autonomous over time 7. ✅ **Full observability** - complete audit trail and analytics **Start small, monitor closely, scale gradually, and let the system learn.** --- For support: automation-team@company.local