it-ops/llm-automation-docs-and-remediation-engine

Files

LLM Automation System 1ba5ce851d Initial commit: LLM Automation Docs & Remediation Engine v2.0

Features:
- Automated datacenter documentation generation
- MCP integration for device connectivity
- Auto-remediation engine with safety checks
- Multi-factor reliability scoring (0-100%)
- Human feedback learning loop
- Pattern recognition and continuous improvement
- Agentic chat support with AI
- API for ticket resolution
- Frontend React with Material-UI
- CI/CD pipelines (GitLab + Gitea)
- Docker & Kubernetes deployment
- Complete documentation and guides

v2.0 Highlights:
- Auto-remediation with write operations (disabled by default)
- Reliability calculator with 4-factor scoring
- Human feedback system for continuous learning
- Pattern-based progressive automation
- Approval workflow for critical actions
- Full audit trail and rollback capability

2025-10-17 23:47:28 +00:00

13 KiB

Raw Blame History

🎉 What's New in v2.0 - Auto-Remediation & Feedback System

🚀 Major New Features

1️⃣ Auto-Remediation (Write Operations) ⚠️

AI can now automatically fix problems by executing write operations on your infrastructure.

Key Points:

✅ DEFAULT: DISABLED - Must explicitly enable per ticket for safety
✅ Smart Decision Engine - Only executes when confidence is high
✅ Safety Checks - Pre/post validation, backups, rollbacks
✅ Approval Workflow - Critical actions require human approval
✅ Full Audit Trail - Every action logged

Example Usage:

# Submit ticket WITH auto-remediation
{
    "ticket_id": "INC-001",
    "description": "Web service not responding",
    "category": "server",
    "enable_auto_remediation": true  # ← Enable write operations
}

# AI will:
# 1. Analyze the problem
# 2. Check reliability score
# 3. If score ≥85% and safe action → Execute automatically
# 4. If critical action → Request approval
# 5. Log all actions taken

What AI Can Do:

Restart services/VMs
Clear caches
Scale deployments
Enable network ports
Expand storage volumes
Rollback deployments

Safety Guardrails:

Minimum 85% reliability required
Rate limiting (max 10 actions/hour)
Time windows (maintenance hours only)
Backup verification
System health checks
Rollback on failure

2️⃣ Reliability Scoring System 📊

Multi-factor confidence calculation that gets smarter over time.

How It Works:

Reliability Score (0-100%) = 
  AI Confidence        × 25% +  # Claude's confidence
  Human Feedback       × 30% +  # User ratings & feedback
  Historical Success   × 25% +  # Past resolution success rate
  Pattern Recognition  × 20%    # Similarity to known issues

Confidence Levels:

Score	Level	Action
90-100%	🟢 Very High	Auto-execute without approval
75-89%	🔵 High	Auto-execute or require approval
60-74%	🟡 Medium	Require approval
0-59%	🔴 Low	Manual resolution only

Example:

{
  "reliability_score": 87.5,
  "confidence_level": "high",
  "breakdown": {
    "ai_confidence": "92%",
    "human_validation": "85%", 
    "success_history": "90%",
    "pattern_recognition": "82%"
  }
}

3️⃣ Human Feedback Loop 🔄

Your feedback makes the AI smarter.

What You Can Provide:

{
  "ticket_id": "INC-001",
  "feedback_type": "positive|negative|neutral",
  "rating": 5,  // 1-5 stars
  "was_helpful": true,
  "resolution_accurate": true,
  "actions_worked": true,
  
  // Optional details
  "comment": "Perfect! Service is back up.",
  "what_worked": "The service restart fixed it",
  "what_didnt_work": null,
  "suggestions": "Could add health check step",
  
  // If AI failed, what actually worked?
  "actual_resolution": "Had to increase memory instead",
  "time_to_resolve": 30.0  // minutes
}

Impact of Feedback:

Immediate: Updates reliability score for that ticket
Pattern Learning: Strengthens/weakens similar issue handling
Future Decisions: Influences auto-remediation eligibility
System Improvement: Better resolutions over time

4️⃣ Pattern Learning & Recognition 🧠

AI learns from repeated issues and gets better at handling them.

How Patterns Work:

Issue occurs first time:
└─ Manual resolution, collect feedback

After 5+ similar issues with good feedback:
├─ Pattern identified and eligible for auto-remediation
├─ Success rate: 85%+
└─ Can auto-fix similar issues in future

After 20+ occurrences:
├─ Very high confidence (90%+)
├─ Success rate: 92%+
└─ Auto-fix without approval (if safe action)

Pattern Eligibility Criteria:

eligible_for_auto_remediation = (
    occurrence_count >= 5 AND
    positive_feedback_rate >= 0.85 AND
    avg_reliability_score >= 85.0 AND
    auto_remediation_success_rate >= 0.85
)

📋 New Database Models

Tables Added:

ticket_feedbacks - Store human feedback
similar_tickets - Track pattern similarities
remediation_logs - Audit trail of actions
auto_remediation_policies - Configuration per category
remediation_approvals - Approval workflow
ticket_patterns - Learned patterns

🔧 New API Endpoints

Core Functionality

# Create ticket with auto-remediation
POST /api/v1/tickets
{
  "enable_auto_remediation": true  # New parameter
}

# Get enhanced ticket status
GET /api/v1/tickets/{ticket_id}
# Returns: reliability_score, remediation_decision, etc.

Feedback System

# Submit feedback
POST /api/v1/feedback

# Get ticket feedback history
GET /api/v1/tickets/{ticket_id}/feedback

Auto-Remediation Control

# Approve/reject remediation
POST /api/v1/tickets/{ticket_id}/approve-remediation

# Get remediation execution logs
GET /api/v1/tickets/{ticket_id}/remediation-logs

Analytics & Monitoring

# Reliability statistics
GET /api/v1/stats/reliability?days=30&category=network

# Auto-remediation statistics
GET /api/v1/stats/auto-remediation?days=30

# View learned patterns
GET /api/v1/patterns?category=network&min_occurrences=5

🎨 Frontend Enhancements

New UI Components:

Auto-Remediation Toggle (with safety warning)
Reliability Score Display (with breakdown)
Feedback Form (star rating, comments, detailed feedback)
Remediation Logs Viewer (audit trail)
Analytics Dashboard (reliability trends, success rates)
Pattern Viewer (learned patterns and eligibility)

Visual Indicators:

🟢 Green: Very high reliability (90%+)
🔵 Blue: High reliability (75-89%)
🟡 Yellow: Medium reliability (60-74%)
🔴 Red: Low reliability (<60%)

📊 Example Workflow

Traditional Flow (v1.0)

1. User submits ticket
2. AI analyzes and suggests resolution
3. User manually executes actions
4. Done

Enhanced Flow (v2.0)

1. User submits ticket with auto_remediation=true
2. AI analyzes problem
3. AI calculates reliability score
4. Decision Engine evaluates:
   ├─ High confidence + safe action → Execute automatically
   ├─ Medium confidence → Request approval
   └─ Low confidence → Manual resolution only
5. If approved/auto-approved:
   ├─ Pre-execution safety checks
   ├─ Execute actions via MCP
   ├─ Post-execution validation
   └─ Log all actions
6. User provides feedback
7. System learns and improves
8. Future similar issues → Faster, smarter resolution

🎯 Use Cases

Use Case 1: Service Down

# Ticket: "Web service not responding"
# Category: server
# Auto-remediation: enabled

AI Analysis:
├─ Identifies: Service crash
├─ Solution: Restart service
├─ Reliability: 92% (based on 15 similar past issues)
├─ Action type: safe_write
└─ Decision: Auto-execute without approval

Result:
├─ Service restarted in 3 seconds
├─ Health check: passed
├─ Action logged
└─ User feedback: ⭐⭐⭐⭐⭐

Future:
└─ Similar issues auto-fixed with 95% confidence

Use Case 2: Storage Full

# Ticket: "Datastore at 98% capacity"
# Category: storage
# Auto-remediation: enabled

AI Analysis:
├─ Identifies: Storage capacity issue
├─ Solution: Expand volume by 100GB
├─ Reliability: 88%
├─ Action type: critical_write (expansion can't be undone easily)
└─ Decision: Require approval

Workflow:
├─ Approval requested from admin
├─ Admin reviews and approves
├─ Pre-check: Backup verified
├─ Volume expanded
├─ Post-check: New space available
└─ Logged with approval trail

Future:
└─ After 10+ successful expansions, may auto-approve

Use Case 3: Network Port Flapping

# Ticket: "Port Gi0/1 flapping on switch"
# Category: network
# Auto-remediation: enabled

AI Analysis:
├─ Identifies: Interface errors causing flapping
├─ Solution: Clear interface errors, bounce port
├─ Reliability: 78% (only 3 similar past issues)
├─ Pattern: Not yet eligible for auto-remediation
└─ Decision: Require approval (not enough history)

After 5+ similar issues with good feedback:
└─ Pattern becomes eligible
└─ Future port issues auto-fixed

🔐 Security & Safety

Built-in Safety Features:

✅ Explicit Opt-in: Auto-remediation disabled by default
✅ Action Classification: Safe vs. critical operations
✅ Reliability Thresholds: Minimum 85% for auto-execution
✅ Approval Workflow: Critical actions require human OK
✅ Rate Limiting: Max 10 actions per hour
✅ Pre-execution Checks: Health, backups, time windows
✅ Post-execution Validation: Verify success
✅ Rollback Capability: Undo on failure
✅ Full Audit Trail: Every action logged
✅ Pattern Validation: Only proven patterns get auto-remediation

What AI Will NEVER Do:

❌ Delete data without approval
❌ Modify critical configs without approval
❌ Shutdown production systems without approval
❌ Execute during business hours (if restricted)
❌ Exceed rate limits
❌ Act on low-confidence issues
❌ Proceed if safety checks fail

📈 Expected Benefits

Operational Efficiency

90% reduction in time to resolution for common issues
80% of repetitive issues auto-resolved
<3 seconds average resolution time for known patterns
24/7 automated response even outside business hours

Quality Improvements

Consistent resolutions (no human error)
Learning from feedback (gets better over time)
Documented audit trail (full transparency)
Proactive pattern recognition

Cost Savings

70-80% reduction in operational overhead for common issues
Faster mean time to resolution (MTTR)
Fewer escalations
Better resource utilization

🚦 Rollout Strategy

Phase 1: Pilot (Week 1-2)

Enable for cache/restart operations only
5% of tickets
Require approval for all
Monitor closely

Phase 2: Expansion (Week 3-4)

Add safe network operations
20% of tickets
Auto-approve if reliability ≥ 95%
Collect feedback aggressively

Phase 3: Scale (Week 5-6)

Enable for all safe operations
50% of tickets
Auto-approve if reliability ≥ 90%
Patterns becoming eligible

Phase 4: Full Deployment (Week 7+)

All categories (except security)
100% availability
Dynamic thresholds based on performance
Continuous improvement

📚 Documentation

New documentation added:

AUTO_REMEDIATION_GUIDE.md - Complete guide (THIS FILE)
API_ENHANCED.md - Enhanced API documentation
RELIABILITY_SCORING.md - Deep dive on scoring
FEEDBACK_SYSTEM.md - Feedback loop details
PATTERN_LEARNING.md - How patterns work

🎓 Training & Adoption

For Operators:

Read AUTO_REMEDIATION_GUIDE.md
Start with low-risk categories
Always provide feedback
Monitor logs and analytics
Adjust thresholds based on results

For Administrators:

Configure auto_remediation_policies
Set appropriate thresholds per category
Define approval workflows
Monitor system performance
Review and approve critical actions

For Developers:

Integrate API endpoints
Implement feedback collection
Use reliability scores in decisions
Monitor metrics and alerts
Contribute to pattern improvement

🔄 Migration from v1.0

Breaking Changes:

None! v2.0 is fully backward compatible.

Existing tickets continue to work
Auto-remediation is opt-in
All v1.0 APIs still functional

New Defaults:

enable_auto_remediation: false (explicit opt-in required)
requires_approval: true (by default)
min_reliability_score: 85.0

Database Migration:

# Run Alembic migrations
poetry run alembic upgrade head

# Migrations add new tables:
# - ticket_feedbacks
# - similar_tickets
# - remediation_logs
# - auto_remediation_policies
# - remediation_approvals
# - ticket_patterns

🎉 Summary

v2.0 adds intelligent, safe, self-improving auto-remediation:

✅ AI can now fix problems automatically (disabled by default)
✅ Multi-factor reliability scoring (gets smarter over time)
✅ Human feedback loop (continuous learning)
✅ Pattern recognition (learns from similar issues)
✅ Approval workflow (safety for critical actions)
✅ Full audit trail (complete transparency)
✅ Progressive automation (starts conservative, scales based on success)

The system learns from every interaction and gets better over time!

📞 Support

Email: automation-team@company.local
Slack: #datacenter-automation
Documentation: /docs/auto-remediation
Issues: git.company.local/infrastructure/datacenter-docs/issues

Ready to try auto-remediation? Start with a low-risk ticket and let the AI show you what it can do! 🚀

13 KiB Raw Blame History Unescape Escape