Files
llm-automation-docs-and-rem…/AUTO_REMEDIATION_GUIDE.md
LLM Automation System 1ba5ce851d Initial commit: LLM Automation Docs & Remediation Engine v2.0
Features:
- Automated datacenter documentation generation
- MCP integration for device connectivity
- Auto-remediation engine with safety checks
- Multi-factor reliability scoring (0-100%)
- Human feedback learning loop
- Pattern recognition and continuous improvement
- Agentic chat support with AI
- API for ticket resolution
- Frontend React with Material-UI
- CI/CD pipelines (GitLab + Gitea)
- Docker & Kubernetes deployment
- Complete documentation and guides

v2.0 Highlights:
- Auto-remediation with write operations (disabled by default)
- Reliability calculator with 4-factor scoring
- Human feedback system for continuous learning
- Pattern-based progressive automation
- Approval workflow for critical actions
- Full audit trail and rollback capability
2025-10-17 23:47:28 +00:00

19 KiB
Raw Blame History

🤖 Auto-Remediation System - Complete Documentation

📋 Table of Contents

  1. Overview
  2. Safety First Design
  3. Reliability Scoring System
  4. Human Feedback Loop
  5. Decision Engine
  6. Auto-Remediation Execution
  7. Pattern Learning
  8. API Usage
  9. Configuration
  10. Monitoring & Analytics

Overview

The Auto-Remediation System enables AI to autonomously resolve infrastructure issues by executing write operations on your systems. This is a production-grade implementation with extensive safety checks, human oversight, and continuous learning.

Key Features

Safety-First: Auto-remediation disabled by default
Reliability Scoring: Multi-factor confidence calculation (0-100%)
Human Feedback: Continuous learning from user feedback
Pattern Recognition: Learns from similar issues
Approval Workflow: Critical actions require human approval
Full Audit Trail: Every action logged with rollback capability
Progressive Automation: Decisions improve over time based on success rate


Safety First Design

🛡️ Default State: DISABLED

# Example: Ticket submission
{
    "ticket_id": "INC-001",
    "description": "Problem description",
    "enable_auto_remediation": false  # ← DEFAULT: Disabled
}

Auto-remediation must be explicitly enabled for each ticket.

Safety Layers

  1. Explicit Enablement: Must opt-in per ticket
  2. Reliability Thresholds: Minimum confidence required
  3. Action Classification: Safe vs. Critical operations
  4. Pre-execution Checks: System health, backups, rate limits
  5. Human Approval: Required for low-reliability or critical actions
  6. Post-execution Validation: Verify success
  7. Rollback Capability: Undo on failure

Action Classification

class RemediationAction(str, enum.Enum):
    READ_ONLY = "read_only"           # No changes (default)
    SAFE_WRITE = "safe_write"          # Non-destructive (restart, clear cache)
    CRITICAL_WRITE = "critical_write"  # Potentially destructive (delete, modify)

Critical actions ALWAYS require human approval, regardless of confidence.


Reliability Scoring System

Multi-Factor Calculation

The reliability score (0-100%) is calculated from 4 components:

Reliability Score = (
    AI Confidence    × 25% +  # Model's own confidence
    Human Feedback   × 30% +  # Historical feedback quality
    Success History  × 25% +  # Past resolution success rate
    Pattern Match    × 20%    # Similarity to known patterns
)

Component Details

1. AI Confidence (25%)

  • Direct from Claude Sonnet 4.5
  • Based on documentation quality and analysis certainty
  • Range: 0-1 converted to 0-100%

2. Human Feedback (30%)

  • Weighted by recency (recent feedback = more weight)
  • Considers:
    • Positive/Negative/Neutral feedback type
    • Star ratings (1-5)
    • Resolution accuracy
    • Action effectiveness
feedback_score = (
    positive_feedback_rate × 100 +
    average_rating / 5 × 100
) / 2

3. Historical Success (25%)

  • Success rate in same category (last 6 months)
  • Formula: resolved_tickets / total_tickets × 100

4. Pattern Match (20%)

  • Similarity to known, resolved patterns
  • Requires ≥3 similar tickets for pattern
  • Boosts score if pattern has positive feedback

Confidence Levels

Score Range Level Description
90-100% Very High Excellent track record, safe to auto-execute
75-89% High Good reliability, may require approval
60-74% Medium Moderate confidence, approval recommended
0-59% Low Low confidence, manual review required

Example Breakdown

{
  "overall_score": 87.5,
  "confidence_level": "high",
  "breakdown": {
    "ai_confidence": "92%",
    "human_validation": "85%",
    "success_history": "90%",
    "pattern_recognition": "82%"
  }
}

Human Feedback Loop

Feedback Collection

After each ticket resolution, collect structured feedback:

{
  "ticket_id": "INC-001",
  "feedback_type": "positive|negative|neutral",
  "rating": 5,  # 1-5 stars
  "was_helpful": true,
  "resolution_accurate": true,
  "actions_worked": true,
  
  # Optional detailed feedback
  "comment": "Great resolution!",
  "what_worked": "The restart fixed it",
  "what_didnt_work": null,
  "suggestions": "Could add more details",
  
  # If AI failed, what actually worked?
  "actual_resolution": "Had to increase memory instead",
  "actual_actions_taken": [...],
  "time_to_resolve": 30.0  # minutes
}

Feedback Impact

  1. Immediate: Updates ticket reliability score
  2. Pattern Learning: Strengthens/weakens pattern eligibility
  3. Future Decisions: Influences similar ticket handling
  4. Auto-remediation Eligibility: Pattern becomes eligible after:
    • ≥5 occurrences
    • ≥85% positive feedback rate
    • ≥85% average reliability score

Feedback Analytics

Track feedback trends:

  • Positive/Negative/Neutral distribution
  • Average ratings by category
  • Resolution accuracy trends
  • Action success rates

Decision Engine

Decision Flow

1. Check: Auto-remediation enabled for ticket?
   ├─ NO → Skip auto-remediation
   └─ YES → Continue

2. Get applicable policy for category
   ├─ No policy → Require manual approval
   └─ Policy exists → Continue

3. Classify action risk level
   ├─ READ_ONLY → Low risk
   ├─ SAFE_WRITE → Medium risk
   └─ CRITICAL_WRITE → High risk

4. Check confidence & reliability thresholds
   ├─ Below minimum → Reject
   └─ Above minimum → Continue

5. Perform safety checks
   ├─ Pre-checks failed → Reject
   └─ All passed → Continue

6. Check pattern eligibility
   ├─ Unknown pattern → Require approval
   └─ Known good pattern → Continue

7. Determine approval requirement
   ├─ Reliability ≥ auto_approve_threshold → Auto-approve
   ├─ Critical action → Require approval
   └─ Otherwise → Follow policy

8. Execute or await approval

Decision Example

{
  "allowed": true,
  "action_type": "safe_write",
  "requires_approval": false,
  "reasoning": [
    "All checks passed",
    "Auto-approved: reliability 92% >= 90%"
  ],
  "safety_checks": {
    "time_window_ok": true,
    "rate_limit_ok": true,
    "backup_available": true,
    "system_healthy": true,
    "all_passed": true
  },
  "risk_level": "medium"
}

Auto-Remediation Execution

Execution Flow

async def execute_remediation(ticket, actions, decision):
    # 1. Verify decision allows execution
    if not decision['allowed']:
        return error
    
    # 2. Check approval if required
    if decision['requires_approval']:
        if not has_approval(ticket):
            return "awaiting_approval"
    
    # 3. Execute each action with safety
    for action in actions:
        # Pre-execution check
        pre_check = await check_system_health()
        if not pre_check.passed:
            rollback()
            return error
        
        # Execute action via MCP
        result = await execute_via_mcp(action)
        
        # Post-execution verification
        post_check = await verify_success()
        if not post_check.passed:
            rollback()
            return error
        
        # Log action
        log_remediation(action, result)
    
    return success

Supported Operations

VMware

  • restart_vm - Graceful VM restart
  • snapshot_vm - Create snapshot
  • increase_memory - Increase VM memory
  • increase_cpu - Add vCPUs

Kubernetes

  • restart_pod - Delete pod (recreate)
  • scale_deployment - Change replica count
  • rollback_deployment - Rollback to previous version

Network

  • clear_interface_errors - Clear interface counters
  • enable_port - Enable disabled port
  • restart_interface - Bounce interface

Storage

  • expand_volume - Increase volume size
  • clear_snapshots - Remove old snapshots

OpenStack

  • reboot_instance - Soft reboot instance
  • resize_instance - Change instance flavor

Safety Checks

Pre-execution:

  • System health check (CPU, memory, disk)
  • Backup availability verification
  • Rate limit check (max 10/hour)
  • Time window check (maintenance hours)

Post-execution:

  • Resource health verification
  • Service availability check
  • Performance metrics validation

Rollback

If any action fails:

  1. Stop execution immediately
  2. Log failure details
  3. Execute rollback procedures
  4. Notify administrators
  5. Update ticket status to partially_remediated

Pattern Learning

Pattern Identification

# Generate pattern signature
pattern = {
    'category': 'network',
    'key_terms': ['vlan', 'connectivity', 'timeout'],
    'hash': sha256(signature)
}

Pattern Statistics

Tracked for each pattern:

  • Occurrence count: How many times seen
  • Success/failure counts: Resolution outcomes
  • Feedback distribution: Positive/negative/neutral
  • Average confidence: Mean AI confidence
  • Average reliability: Mean reliability score
  • Auto-remediation success rate: % of successful auto-fixes

Pattern Eligibility

Pattern becomes eligible for auto-remediation when:

if (
    pattern.occurrence_count >= 5 and
    pattern.positive_feedback_rate >= 0.85 and
    pattern.avg_reliability_score >= 85.0 and
    pattern.auto_remediation_success_rate >= 0.85
):
    pattern.eligible_for_auto_remediation = True

Pattern Evolution

Initial State:
├─ occurrence_count: 1
├─ eligible_for_auto_remediation: false
└─ Manual resolution only

After 5+ occurrences with good feedback:
├─ occurrence_count: 7
├─ positive_feedback_rate: 0.85
├─ avg_reliability_score: 87.0
├─ eligible_for_auto_remediation: true
└─ Can trigger auto-remediation

After 20+ occurrences:
├─ occurrence_count: 24
├─ auto_remediation_success_rate: 0.92
├─ Very high confidence
└─ Auto-remediation without approval

API Usage

Create Ticket with Auto-Remediation

curl -X POST http://localhost:8000/api/v1/tickets \
  -H "Content-Type: application/json" \
  -d '{
    "ticket_id": "INC-12345",
    "title": "Service down",
    "description": "Web service not responding on port 8080",
    "category": "server",
    "enable_auto_remediation": true
  }'

Response:

{
  "ticket_id": "INC-12345",
  "status": "processing",
  "auto_remediation_enabled": true,
  "confidence_score": 0.0,
  "reliability_score": null
}

Check Ticket Status

curl http://localhost:8000/api/v1/tickets/INC-12345

Response:

{
  "ticket_id": "INC-12345",
  "status": "resolved",
  "resolution": "Service was restarted successfully...",
  "suggested_actions": [
    {"action": "Restart web service", "system": "prod-web-01"}
  ],
  "confidence_score": 0.92,
  "reliability_score": 87.5,
  "reliability_breakdown": {
    "overall_score": 87.5,
    "confidence_level": "high",
    "breakdown": {...}
  },
  "auto_remediation_enabled": true,
  "auto_remediation_executed": true,
  "remediation_decision": {
    "allowed": true,
    "requires_approval": false,
    "action_type": "safe_write"
  },
  "remediation_results": {
    "success": true,
    "executed_actions": [...]
  }
}

Submit Feedback

curl -X POST http://localhost:8000/api/v1/feedback \
  -H "Content-Type: application/json" \
  -d '{
    "ticket_id": "INC-12345",
    "feedback_type": "positive",
    "rating": 5,
    "was_helpful": true,
    "resolution_accurate": true,
    "actions_worked": true,
    "comment": "Perfect resolution, service is back up!"
  }'

Approve Remediation

For tickets requiring approval:

curl -X POST http://localhost:8000/api/v1/tickets/INC-12345/approve-remediation \
  -H "Content-Type: application/json" \
  -d '{
    "ticket_id": "INC-12345",
    "approve": true,
    "approver": "john.doe@company.com",
    "comment": "Approved for execution"
  }'

Get Analytics

# Reliability statistics
curl http://localhost:8000/api/v1/stats/reliability?days=30

# Auto-remediation statistics
curl http://localhost:8000/api/v1/stats/auto-remediation?days=30

# Learned patterns
curl http://localhost:8000/api/v1/patterns?category=network&min_occurrences=5

Configuration

Auto-Remediation Policy

policy = AutoRemediationPolicy(
    name="network-auto-remediation",
    category="network",
    
    # Thresholds
    min_confidence_score=0.85,      # 85% AI confidence required
    min_reliability_score=80.0,     # 80% reliability required
    min_similar_tickets=5,          # Need 5+ similar resolved tickets
    min_positive_feedback_rate=0.8, # 80% positive feedback required
    
    # Allowed actions
    allowed_action_types=["safe_write"],
    allowed_systems=["network"],
    forbidden_commands=["delete", "format", "shutdown"],
    
    # Time restrictions
    allowed_hours_start=22,  # 10 PM
    allowed_hours_end=6,     # 6 AM
    allowed_days=["monday", "tuesday", "wednesday", "thursday", "friday"],
    
    # Approval
    requires_approval=True,
    auto_approve_threshold=90.0,  # Auto-approve if reliability ≥ 90%
    approvers=["admin@company.com"],
    
    # Safety
    max_actions_per_hour=10,
    requires_rollback_plan=True,
    requires_backup=True,
    
    # Status
    enabled=True
)

Environment Variables

# Enable/disable auto-remediation globally
AUTO_REMEDIATION_ENABLED=true

# Global safety settings
AUTO_REMEDIATION_MAX_ACTIONS_PER_HOUR=10
AUTO_REMEDIATION_REQUIRE_APPROVAL=true
AUTO_REMEDIATION_MIN_RELIABILITY=85.0

# Pattern learning
PATTERN_MIN_OCCURRENCES=5
PATTERN_MIN_POSITIVE_RATE=0.85

Monitoring & Analytics

Key Metrics

# Reliability metrics
- avg_reliability_score: Average across all tickets
- avg_confidence_score: Average AI confidence
- resolution_rate: % of tickets resolved

# Auto-remediation metrics
- execution_rate: % of enabled tickets that were auto-remediated
- success_rate: % of auto-remediation actions that succeeded
- approval_rate: % requiring human approval

# Feedback metrics
- positive_feedback_rate: % positive feedback
- negative_feedback_rate: % negative feedback
- avg_rating: Average star rating (1-5)

# Pattern metrics
- eligible_patterns: # of patterns eligible for auto-remediation
- pattern_success_rate: Success rate across all patterns

Grafana Dashboards

Example metrics:

# Reliability score trend
avg(datacenter_docs_reliability_score) by (category)

# Auto-remediation success rate
rate(datacenter_docs_auto_remediation_success_total[1h]) /
rate(datacenter_docs_auto_remediation_attempts_total[1h])

# Feedback sentiment
sum(datacenter_docs_feedback_total) by (type)

Alerts

# Low reliability alert
- alert: LowReliabilityScore
  expr: avg(datacenter_docs_reliability_score) < 70
  for: 1h
  annotations:
    summary: "Reliability score below threshold"

# High failure rate
- alert: HighAutoRemediationFailureRate
  expr: rate(datacenter_docs_auto_remediation_failures_total[1h]) > 0.2
  for: 15m
  annotations:
    summary: "Auto-remediation failure rate > 20%"

Best Practices

1. Start Conservative

  • Enable auto-remediation for low-risk categories first (e.g., cache clearing)
  • Set high thresholds initially (reliability ≥ 90%)
  • Require approvals for first 20-30 occurrences
  • Monitor closely and adjust based on results

2. Gradual Rollout

Week 1-2: Enable for 5% of tickets
Week 3-4: Increase to 20% if success rate > 90%
Week 5-6: Increase to 50% if success rate > 85%
Week 7+:  Full rollout with dynamic thresholds

3. Category-Specific Policies

Different categories need different thresholds:

Category Min Reliability Auto-Approve Reason
Cache 75% 85% Low risk, frequent
Network 85% 90% Medium risk
Storage 90% 95% High risk
Security 95% Never Critical, always approve

4. Human in the Loop

  • Always collect feedback, even for successful auto-remediations
  • Review logs weekly
  • Adjust thresholds based on feedback trends
  • Disable patterns with declining success rates

5. Continuous Learning

  • System improves over time through feedback
  • Patterns with 20+ occurrences and 90%+ success → Very high confidence
  • Allow system to become more autonomous as reliability proves out
  • But maintain human oversight for critical operations

Troubleshooting

Auto-remediation not executing

Check:

  1. Is enable_auto_remediation: true in ticket?
  2. Is there an active policy for the category?
  3. Does confidence/reliability meet thresholds?
  4. Are safety checks passing?
  5. Does pattern meet eligibility requirements?

Debug:

# Check decision
curl http://localhost:8000/api/v1/tickets/TICKET-ID | jq '.remediation_decision'

# Check logs
curl http://localhost:8000/api/v1/tickets/TICKET-ID/remediation-logs

Low reliability scores

Causes:

  • Insufficient historical data
  • Negative feedback on category
  • Low pattern match confidence
  • Recent failures in category

Solutions:

  • Collect more feedback
  • Review and improve resolutions
  • Wait for more data points
  • Manually resolve similar tickets successfully

Pattern not becoming eligible

Requirements not met:

  • Need ≥5 occurrences
  • Need ≥85% positive feedback
  • Need ≥85% average reliability

Action:

  • Continue resolving similar tickets
  • Ensure feedback is being collected
  • Check pattern stats: GET /api/v1/patterns

Future Enhancements

  • Multi-step reasoning: Complex workflows spanning multiple systems
  • Predictive remediation: Fix issues before they cause incidents
  • A/B testing: Compare different resolution strategies
  • Reinforcement learning: Optimize actions based on outcomes
  • Natural language explanations: Better transparency in decisions
  • Cross-system orchestration: Coordinated actions across infrastructure

Summary

The Auto-Remediation System is designed for safe, gradual automation of infrastructure issue resolution:

  1. Disabled by default - explicit opt-in per ticket
  2. Multi-factor reliability - comprehensive confidence calculation
  3. Human feedback loop - continuous learning and improvement
  4. Pattern recognition - learns from similar issues
  5. Safety first - extensive checks, approval workflows, rollback
  6. Progressive automation - system becomes more autonomous over time
  7. Full observability - complete audit trail and analytics

Start small, monitor closely, scale gradually, and let the system learn.


For support: automation-team@company.local