Files
llm-automation-docs-and-rem…/AUTO_REMEDIATION_GUIDE.md
LLM Automation System 1ba5ce851d Initial commit: LLM Automation Docs & Remediation Engine v2.0
Features:
- Automated datacenter documentation generation
- MCP integration for device connectivity
- Auto-remediation engine with safety checks
- Multi-factor reliability scoring (0-100%)
- Human feedback learning loop
- Pattern recognition and continuous improvement
- Agentic chat support with AI
- API for ticket resolution
- Frontend React with Material-UI
- CI/CD pipelines (GitLab + Gitea)
- Docker & Kubernetes deployment
- Complete documentation and guides

v2.0 Highlights:
- Auto-remediation with write operations (disabled by default)
- Reliability calculator with 4-factor scoring
- Human feedback system for continuous learning
- Pattern-based progressive automation
- Approval workflow for critical actions
- Full audit trail and rollback capability
2025-10-17 23:47:28 +00:00

752 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🤖 Auto-Remediation System - Complete Documentation
## 📋 Table of Contents
1. [Overview](#overview)
2. [Safety First Design](#safety-first-design)
3. [Reliability Scoring System](#reliability-scoring-system)
4. [Human Feedback Loop](#human-feedback-loop)
5. [Decision Engine](#decision-engine)
6. [Auto-Remediation Execution](#auto-remediation-execution)
7. [Pattern Learning](#pattern-learning)
8. [API Usage](#api-usage)
9. [Configuration](#configuration)
10. [Monitoring & Analytics](#monitoring--analytics)
---
## Overview
The **Auto-Remediation System** enables AI to autonomously resolve infrastructure issues by executing write operations on your systems. This is a **production-grade** implementation with extensive safety checks, human oversight, and continuous learning.
### Key Features
**Safety-First**: Auto-remediation **disabled by default**
**Reliability Scoring**: Multi-factor confidence calculation (0-100%)
**Human Feedback**: Continuous learning from user feedback
**Pattern Recognition**: Learns from similar issues
**Approval Workflow**: Critical actions require human approval
**Full Audit Trail**: Every action logged with rollback capability
**Progressive Automation**: Decisions improve over time based on success rate
---
## Safety First Design
### 🛡️ Default State: DISABLED
```python
# Example: Ticket submission
{
"ticket_id": "INC-001",
"description": "Problem description",
"enable_auto_remediation": false # ← DEFAULT: Disabled
}
```
**Auto-remediation must be explicitly enabled for each ticket.**
### Safety Layers
1. **Explicit Enablement**: Must opt-in per ticket
2. **Reliability Thresholds**: Minimum confidence required
3. **Action Classification**: Safe vs. Critical operations
4. **Pre-execution Checks**: System health, backups, rate limits
5. **Human Approval**: Required for low-reliability or critical actions
6. **Post-execution Validation**: Verify success
7. **Rollback Capability**: Undo on failure
### Action Classification
```python
class RemediationAction(str, enum.Enum):
READ_ONLY = "read_only" # No changes (default)
SAFE_WRITE = "safe_write" # Non-destructive (restart, clear cache)
CRITICAL_WRITE = "critical_write" # Potentially destructive (delete, modify)
```
**Critical actions ALWAYS require human approval**, regardless of confidence.
---
## Reliability Scoring System
### Multi-Factor Calculation
The reliability score (0-100%) is calculated from **4 components**:
```python
Reliability Score = (
AI Confidence × 25% + # Model's own confidence
Human Feedback × 30% + # Historical feedback quality
Success History × 25% + # Past resolution success rate
Pattern Match × 20% # Similarity to known patterns
)
```
### Component Details
#### 1. AI Confidence (25%)
- Direct from Claude Sonnet 4.5
- Based on documentation quality and analysis certainty
- Range: 0-1 converted to 0-100%
#### 2. Human Feedback (30%)
- Weighted by recency (recent feedback = more weight)
- Considers:
- Positive/Negative/Neutral feedback type
- Star ratings (1-5)
- Resolution accuracy
- Action effectiveness
```python
feedback_score = (
positive_feedback_rate × 100 +
average_rating / 5 × 100
) / 2
```
#### 3. Historical Success (25%)
- Success rate in same category (last 6 months)
- Formula: `resolved_tickets / total_tickets × 100`
#### 4. Pattern Match (20%)
- Similarity to known, resolved patterns
- Requires ≥3 similar tickets for pattern
- Boosts score if pattern has positive feedback
### Confidence Levels
| Score Range | Level | Description |
|-------------|-----------|-------------|
| 90-100% | Very High | Excellent track record, safe to auto-execute |
| 75-89% | High | Good reliability, may require approval |
| 60-74% | Medium | Moderate confidence, approval recommended |
| 0-59% | Low | Low confidence, manual review required |
### Example Breakdown
```json
{
"overall_score": 87.5,
"confidence_level": "high",
"breakdown": {
"ai_confidence": "92%",
"human_validation": "85%",
"success_history": "90%",
"pattern_recognition": "82%"
}
}
```
---
## Human Feedback Loop
### Feedback Collection
After each ticket resolution, collect structured feedback:
```python
{
"ticket_id": "INC-001",
"feedback_type": "positive|negative|neutral",
"rating": 5, # 1-5 stars
"was_helpful": true,
"resolution_accurate": true,
"actions_worked": true,
# Optional detailed feedback
"comment": "Great resolution!",
"what_worked": "The restart fixed it",
"what_didnt_work": null,
"suggestions": "Could add more details",
# If AI failed, what actually worked?
"actual_resolution": "Had to increase memory instead",
"actual_actions_taken": [...],
"time_to_resolve": 30.0 # minutes
}
```
### Feedback Impact
1. **Immediate**: Updates ticket reliability score
2. **Pattern Learning**: Strengthens/weakens pattern eligibility
3. **Future Decisions**: Influences similar ticket handling
4. **Auto-remediation Eligibility**: Pattern becomes eligible after:
- ≥5 occurrences
- ≥85% positive feedback rate
- ≥85% average reliability score
### Feedback Analytics
Track feedback trends:
- Positive/Negative/Neutral distribution
- Average ratings by category
- Resolution accuracy trends
- Action success rates
---
## Decision Engine
### Decision Flow
```
1. Check: Auto-remediation enabled for ticket?
├─ NO → Skip auto-remediation
└─ YES → Continue
2. Get applicable policy for category
├─ No policy → Require manual approval
└─ Policy exists → Continue
3. Classify action risk level
├─ READ_ONLY → Low risk
├─ SAFE_WRITE → Medium risk
└─ CRITICAL_WRITE → High risk
4. Check confidence & reliability thresholds
├─ Below minimum → Reject
└─ Above minimum → Continue
5. Perform safety checks
├─ Pre-checks failed → Reject
└─ All passed → Continue
6. Check pattern eligibility
├─ Unknown pattern → Require approval
└─ Known good pattern → Continue
7. Determine approval requirement
├─ Reliability ≥ auto_approve_threshold → Auto-approve
├─ Critical action → Require approval
└─ Otherwise → Follow policy
8. Execute or await approval
```
### Decision Example
```json
{
"allowed": true,
"action_type": "safe_write",
"requires_approval": false,
"reasoning": [
"All checks passed",
"Auto-approved: reliability 92% >= 90%"
],
"safety_checks": {
"time_window_ok": true,
"rate_limit_ok": true,
"backup_available": true,
"system_healthy": true,
"all_passed": true
},
"risk_level": "medium"
}
```
---
## Auto-Remediation Execution
### Execution Flow
```python
async def execute_remediation(ticket, actions, decision):
# 1. Verify decision allows execution
if not decision['allowed']:
return error
# 2. Check approval if required
if decision['requires_approval']:
if not has_approval(ticket):
return "awaiting_approval"
# 3. Execute each action with safety
for action in actions:
# Pre-execution check
pre_check = await check_system_health()
if not pre_check.passed:
rollback()
return error
# Execute action via MCP
result = await execute_via_mcp(action)
# Post-execution verification
post_check = await verify_success()
if not post_check.passed:
rollback()
return error
# Log action
log_remediation(action, result)
return success
```
### Supported Operations
#### VMware
- `restart_vm` - Graceful VM restart
- `snapshot_vm` - Create snapshot
- `increase_memory` - Increase VM memory
- `increase_cpu` - Add vCPUs
#### Kubernetes
- `restart_pod` - Delete pod (recreate)
- `scale_deployment` - Change replica count
- `rollback_deployment` - Rollback to previous version
#### Network
- `clear_interface_errors` - Clear interface counters
- `enable_port` - Enable disabled port
- `restart_interface` - Bounce interface
#### Storage
- `expand_volume` - Increase volume size
- `clear_snapshots` - Remove old snapshots
#### OpenStack
- `reboot_instance` - Soft reboot instance
- `resize_instance` - Change instance flavor
### Safety Checks
**Pre-execution:**
- System health check (CPU, memory, disk)
- Backup availability verification
- Rate limit check (max 10/hour)
- Time window check (maintenance hours)
**Post-execution:**
- Resource health verification
- Service availability check
- Performance metrics validation
### Rollback
If any action fails:
1. Stop execution immediately
2. Log failure details
3. Execute rollback procedures
4. Notify administrators
5. Update ticket status to `partially_remediated`
---
## Pattern Learning
### Pattern Identification
```python
# Generate pattern signature
pattern = {
'category': 'network',
'key_terms': ['vlan', 'connectivity', 'timeout'],
'hash': sha256(signature)
}
```
### Pattern Statistics
Tracked for each pattern:
- **Occurrence count**: How many times seen
- **Success/failure counts**: Resolution outcomes
- **Feedback distribution**: Positive/negative/neutral
- **Average confidence**: Mean AI confidence
- **Average reliability**: Mean reliability score
- **Auto-remediation success rate**: % of successful auto-fixes
### Pattern Eligibility
Pattern becomes eligible for auto-remediation when:
```python
if (
pattern.occurrence_count >= 5 and
pattern.positive_feedback_rate >= 0.85 and
pattern.avg_reliability_score >= 85.0 and
pattern.auto_remediation_success_rate >= 0.85
):
pattern.eligible_for_auto_remediation = True
```
### Pattern Evolution
```
Initial State:
├─ occurrence_count: 1
├─ eligible_for_auto_remediation: false
└─ Manual resolution only
After 5+ occurrences with good feedback:
├─ occurrence_count: 7
├─ positive_feedback_rate: 0.85
├─ avg_reliability_score: 87.0
├─ eligible_for_auto_remediation: true
└─ Can trigger auto-remediation
After 20+ occurrences:
├─ occurrence_count: 24
├─ auto_remediation_success_rate: 0.92
├─ Very high confidence
└─ Auto-remediation without approval
```
---
## API Usage
### Create Ticket with Auto-Remediation
```bash
curl -X POST http://localhost:8000/api/v1/tickets \
-H "Content-Type: application/json" \
-d '{
"ticket_id": "INC-12345",
"title": "Service down",
"description": "Web service not responding on port 8080",
"category": "server",
"enable_auto_remediation": true
}'
```
**Response:**
```json
{
"ticket_id": "INC-12345",
"status": "processing",
"auto_remediation_enabled": true,
"confidence_score": 0.0,
"reliability_score": null
}
```
### Check Ticket Status
```bash
curl http://localhost:8000/api/v1/tickets/INC-12345
```
**Response:**
```json
{
"ticket_id": "INC-12345",
"status": "resolved",
"resolution": "Service was restarted successfully...",
"suggested_actions": [
{"action": "Restart web service", "system": "prod-web-01"}
],
"confidence_score": 0.92,
"reliability_score": 87.5,
"reliability_breakdown": {
"overall_score": 87.5,
"confidence_level": "high",
"breakdown": {...}
},
"auto_remediation_enabled": true,
"auto_remediation_executed": true,
"remediation_decision": {
"allowed": true,
"requires_approval": false,
"action_type": "safe_write"
},
"remediation_results": {
"success": true,
"executed_actions": [...]
}
}
```
### Submit Feedback
```bash
curl -X POST http://localhost:8000/api/v1/feedback \
-H "Content-Type: application/json" \
-d '{
"ticket_id": "INC-12345",
"feedback_type": "positive",
"rating": 5,
"was_helpful": true,
"resolution_accurate": true,
"actions_worked": true,
"comment": "Perfect resolution, service is back up!"
}'
```
### Approve Remediation
For tickets requiring approval:
```bash
curl -X POST http://localhost:8000/api/v1/tickets/INC-12345/approve-remediation \
-H "Content-Type: application/json" \
-d '{
"ticket_id": "INC-12345",
"approve": true,
"approver": "john.doe@company.com",
"comment": "Approved for execution"
}'
```
### Get Analytics
```bash
# Reliability statistics
curl http://localhost:8000/api/v1/stats/reliability?days=30
# Auto-remediation statistics
curl http://localhost:8000/api/v1/stats/auto-remediation?days=30
# Learned patterns
curl http://localhost:8000/api/v1/patterns?category=network&min_occurrences=5
```
---
## Configuration
### Auto-Remediation Policy
```python
policy = AutoRemediationPolicy(
name="network-auto-remediation",
category="network",
# Thresholds
min_confidence_score=0.85, # 85% AI confidence required
min_reliability_score=80.0, # 80% reliability required
min_similar_tickets=5, # Need 5+ similar resolved tickets
min_positive_feedback_rate=0.8, # 80% positive feedback required
# Allowed actions
allowed_action_types=["safe_write"],
allowed_systems=["network"],
forbidden_commands=["delete", "format", "shutdown"],
# Time restrictions
allowed_hours_start=22, # 10 PM
allowed_hours_end=6, # 6 AM
allowed_days=["monday", "tuesday", "wednesday", "thursday", "friday"],
# Approval
requires_approval=True,
auto_approve_threshold=90.0, # Auto-approve if reliability ≥ 90%
approvers=["admin@company.com"],
# Safety
max_actions_per_hour=10,
requires_rollback_plan=True,
requires_backup=True,
# Status
enabled=True
)
```
### Environment Variables
```bash
# Enable/disable auto-remediation globally
AUTO_REMEDIATION_ENABLED=true
# Global safety settings
AUTO_REMEDIATION_MAX_ACTIONS_PER_HOUR=10
AUTO_REMEDIATION_REQUIRE_APPROVAL=true
AUTO_REMEDIATION_MIN_RELIABILITY=85.0
# Pattern learning
PATTERN_MIN_OCCURRENCES=5
PATTERN_MIN_POSITIVE_RATE=0.85
```
---
## Monitoring & Analytics
### Key Metrics
```python
# Reliability metrics
- avg_reliability_score: Average across all tickets
- avg_confidence_score: Average AI confidence
- resolution_rate: % of tickets resolved
# Auto-remediation metrics
- execution_rate: % of enabled tickets that were auto-remediated
- success_rate: % of auto-remediation actions that succeeded
- approval_rate: % requiring human approval
# Feedback metrics
- positive_feedback_rate: % positive feedback
- negative_feedback_rate: % negative feedback
- avg_rating: Average star rating (1-5)
# Pattern metrics
- eligible_patterns: # of patterns eligible for auto-remediation
- pattern_success_rate: Success rate across all patterns
```
### Grafana Dashboards
Example metrics:
```promql
# Reliability score trend
avg(datacenter_docs_reliability_score) by (category)
# Auto-remediation success rate
rate(datacenter_docs_auto_remediation_success_total[1h]) /
rate(datacenter_docs_auto_remediation_attempts_total[1h])
# Feedback sentiment
sum(datacenter_docs_feedback_total) by (type)
```
### Alerts
```yaml
# Low reliability alert
- alert: LowReliabilityScore
expr: avg(datacenter_docs_reliability_score) < 70
for: 1h
annotations:
summary: "Reliability score below threshold"
# High failure rate
- alert: HighAutoRemediationFailureRate
expr: rate(datacenter_docs_auto_remediation_failures_total[1h]) > 0.2
for: 15m
annotations:
summary: "Auto-remediation failure rate > 20%"
```
---
## Best Practices
### 1. Start Conservative
- Enable auto-remediation for **low-risk categories** first (e.g., cache clearing)
- Set high thresholds initially (reliability ≥ 90%)
- Require approvals for first 20-30 occurrences
- Monitor closely and adjust based on results
### 2. Gradual Rollout
```
Week 1-2: Enable for 5% of tickets
Week 3-4: Increase to 20% if success rate > 90%
Week 5-6: Increase to 50% if success rate > 85%
Week 7+: Full rollout with dynamic thresholds
```
### 3. Category-Specific Policies
Different categories need different thresholds:
| Category | Min Reliability | Auto-Approve | Reason |
|----------|----------------|--------------|--------|
| Cache | 75% | 85% | Low risk, frequent |
| Network | 85% | 90% | Medium risk |
| Storage | 90% | 95% | High risk |
| Security | 95% | Never | Critical, always approve |
### 4. Human in the Loop
- Always collect feedback, even for successful auto-remediations
- Review logs weekly
- Adjust thresholds based on feedback trends
- Disable patterns with declining success rates
### 5. Continuous Learning
- System improves over time through feedback
- Patterns with 20+ occurrences and 90%+ success → Very high confidence
- Allow system to become more autonomous as reliability proves out
- But maintain human oversight for critical operations
---
## Troubleshooting
### Auto-remediation not executing
**Check:**
1. Is `enable_auto_remediation: true` in ticket?
2. Is there an active policy for the category?
3. Does confidence/reliability meet thresholds?
4. Are safety checks passing?
5. Does pattern meet eligibility requirements?
**Debug:**
```bash
# Check decision
curl http://localhost:8000/api/v1/tickets/TICKET-ID | jq '.remediation_decision'
# Check logs
curl http://localhost:8000/api/v1/tickets/TICKET-ID/remediation-logs
```
### Low reliability scores
**Causes:**
- Insufficient historical data
- Negative feedback on category
- Low pattern match confidence
- Recent failures in category
**Solutions:**
- Collect more feedback
- Review and improve resolutions
- Wait for more data points
- Manually resolve similar tickets successfully
### Pattern not becoming eligible
**Requirements not met:**
- Need ≥5 occurrences
- Need ≥85% positive feedback
- Need ≥85% average reliability
**Action:**
- Continue resolving similar tickets
- Ensure feedback is being collected
- Check pattern stats: `GET /api/v1/patterns`
---
## Future Enhancements
- **Multi-step reasoning**: Complex workflows spanning multiple systems
- **Predictive remediation**: Fix issues before they cause incidents
- **A/B testing**: Compare different resolution strategies
- **Reinforcement learning**: Optimize actions based on outcomes
- **Natural language explanations**: Better transparency in decisions
- **Cross-system orchestration**: Coordinated actions across infrastructure
---
## Summary
The **Auto-Remediation System** is designed for **safe, gradual automation** of infrastructure issue resolution:
1.**Disabled by default** - explicit opt-in per ticket
2.**Multi-factor reliability** - comprehensive confidence calculation
3.**Human feedback loop** - continuous learning and improvement
4.**Pattern recognition** - learns from similar issues
5.**Safety first** - extensive checks, approval workflows, rollback
6.**Progressive automation** - system becomes more autonomous over time
7.**Full observability** - complete audit trail and analytics
**Start small, monitor closely, scale gradually, and let the system learn.**
---
For support: automation-team@company.local