Features: - Automated datacenter documentation generation - MCP integration for device connectivity - Auto-remediation engine with safety checks - Multi-factor reliability scoring (0-100%) - Human feedback learning loop - Pattern recognition and continuous improvement - Agentic chat support with AI - API for ticket resolution - Frontend React with Material-UI - CI/CD pipelines (GitLab + Gitea) - Docker & Kubernetes deployment - Complete documentation and guides v2.0 Highlights: - Auto-remediation with write operations (disabled by default) - Reliability calculator with 4-factor scoring - Human feedback system for continuous learning - Pattern-based progressive automation - Approval workflow for critical actions - Full audit trail and rollback capability
19 KiB
🤖 Auto-Remediation System - Complete Documentation
📋 Table of Contents
- Overview
- Safety First Design
- Reliability Scoring System
- Human Feedback Loop
- Decision Engine
- Auto-Remediation Execution
- Pattern Learning
- API Usage
- Configuration
- Monitoring & Analytics
Overview
The Auto-Remediation System enables AI to autonomously resolve infrastructure issues by executing write operations on your systems. This is a production-grade implementation with extensive safety checks, human oversight, and continuous learning.
Key Features
✅ Safety-First: Auto-remediation disabled by default
✅ Reliability Scoring: Multi-factor confidence calculation (0-100%)
✅ Human Feedback: Continuous learning from user feedback
✅ Pattern Recognition: Learns from similar issues
✅ Approval Workflow: Critical actions require human approval
✅ Full Audit Trail: Every action logged with rollback capability
✅ Progressive Automation: Decisions improve over time based on success rate
Safety First Design
🛡️ Default State: DISABLED
# Example: Ticket submission
{
"ticket_id": "INC-001",
"description": "Problem description",
"enable_auto_remediation": false # ← DEFAULT: Disabled
}
Auto-remediation must be explicitly enabled for each ticket.
Safety Layers
- Explicit Enablement: Must opt-in per ticket
- Reliability Thresholds: Minimum confidence required
- Action Classification: Safe vs. Critical operations
- Pre-execution Checks: System health, backups, rate limits
- Human Approval: Required for low-reliability or critical actions
- Post-execution Validation: Verify success
- Rollback Capability: Undo on failure
Action Classification
class RemediationAction(str, enum.Enum):
READ_ONLY = "read_only" # No changes (default)
SAFE_WRITE = "safe_write" # Non-destructive (restart, clear cache)
CRITICAL_WRITE = "critical_write" # Potentially destructive (delete, modify)
Critical actions ALWAYS require human approval, regardless of confidence.
Reliability Scoring System
Multi-Factor Calculation
The reliability score (0-100%) is calculated from 4 components:
Reliability Score = (
AI Confidence × 25% + # Model's own confidence
Human Feedback × 30% + # Historical feedback quality
Success History × 25% + # Past resolution success rate
Pattern Match × 20% # Similarity to known patterns
)
Component Details
1. AI Confidence (25%)
- Direct from Claude Sonnet 4.5
- Based on documentation quality and analysis certainty
- Range: 0-1 converted to 0-100%
2. Human Feedback (30%)
- Weighted by recency (recent feedback = more weight)
- Considers:
- Positive/Negative/Neutral feedback type
- Star ratings (1-5)
- Resolution accuracy
- Action effectiveness
feedback_score = (
positive_feedback_rate × 100 +
average_rating / 5 × 100
) / 2
3. Historical Success (25%)
- Success rate in same category (last 6 months)
- Formula:
resolved_tickets / total_tickets × 100
4. Pattern Match (20%)
- Similarity to known, resolved patterns
- Requires ≥3 similar tickets for pattern
- Boosts score if pattern has positive feedback
Confidence Levels
| Score Range | Level | Description |
|---|---|---|
| 90-100% | Very High | Excellent track record, safe to auto-execute |
| 75-89% | High | Good reliability, may require approval |
| 60-74% | Medium | Moderate confidence, approval recommended |
| 0-59% | Low | Low confidence, manual review required |
Example Breakdown
{
"overall_score": 87.5,
"confidence_level": "high",
"breakdown": {
"ai_confidence": "92%",
"human_validation": "85%",
"success_history": "90%",
"pattern_recognition": "82%"
}
}
Human Feedback Loop
Feedback Collection
After each ticket resolution, collect structured feedback:
{
"ticket_id": "INC-001",
"feedback_type": "positive|negative|neutral",
"rating": 5, # 1-5 stars
"was_helpful": true,
"resolution_accurate": true,
"actions_worked": true,
# Optional detailed feedback
"comment": "Great resolution!",
"what_worked": "The restart fixed it",
"what_didnt_work": null,
"suggestions": "Could add more details",
# If AI failed, what actually worked?
"actual_resolution": "Had to increase memory instead",
"actual_actions_taken": [...],
"time_to_resolve": 30.0 # minutes
}
Feedback Impact
- Immediate: Updates ticket reliability score
- Pattern Learning: Strengthens/weakens pattern eligibility
- Future Decisions: Influences similar ticket handling
- Auto-remediation Eligibility: Pattern becomes eligible after:
- ≥5 occurrences
- ≥85% positive feedback rate
- ≥85% average reliability score
Feedback Analytics
Track feedback trends:
- Positive/Negative/Neutral distribution
- Average ratings by category
- Resolution accuracy trends
- Action success rates
Decision Engine
Decision Flow
1. Check: Auto-remediation enabled for ticket?
├─ NO → Skip auto-remediation
└─ YES → Continue
2. Get applicable policy for category
├─ No policy → Require manual approval
└─ Policy exists → Continue
3. Classify action risk level
├─ READ_ONLY → Low risk
├─ SAFE_WRITE → Medium risk
└─ CRITICAL_WRITE → High risk
4. Check confidence & reliability thresholds
├─ Below minimum → Reject
└─ Above minimum → Continue
5. Perform safety checks
├─ Pre-checks failed → Reject
└─ All passed → Continue
6. Check pattern eligibility
├─ Unknown pattern → Require approval
└─ Known good pattern → Continue
7. Determine approval requirement
├─ Reliability ≥ auto_approve_threshold → Auto-approve
├─ Critical action → Require approval
└─ Otherwise → Follow policy
8. Execute or await approval
Decision Example
{
"allowed": true,
"action_type": "safe_write",
"requires_approval": false,
"reasoning": [
"All checks passed",
"Auto-approved: reliability 92% >= 90%"
],
"safety_checks": {
"time_window_ok": true,
"rate_limit_ok": true,
"backup_available": true,
"system_healthy": true,
"all_passed": true
},
"risk_level": "medium"
}
Auto-Remediation Execution
Execution Flow
async def execute_remediation(ticket, actions, decision):
# 1. Verify decision allows execution
if not decision['allowed']:
return error
# 2. Check approval if required
if decision['requires_approval']:
if not has_approval(ticket):
return "awaiting_approval"
# 3. Execute each action with safety
for action in actions:
# Pre-execution check
pre_check = await check_system_health()
if not pre_check.passed:
rollback()
return error
# Execute action via MCP
result = await execute_via_mcp(action)
# Post-execution verification
post_check = await verify_success()
if not post_check.passed:
rollback()
return error
# Log action
log_remediation(action, result)
return success
Supported Operations
VMware
restart_vm- Graceful VM restartsnapshot_vm- Create snapshotincrease_memory- Increase VM memoryincrease_cpu- Add vCPUs
Kubernetes
restart_pod- Delete pod (recreate)scale_deployment- Change replica countrollback_deployment- Rollback to previous version
Network
clear_interface_errors- Clear interface countersenable_port- Enable disabled portrestart_interface- Bounce interface
Storage
expand_volume- Increase volume sizeclear_snapshots- Remove old snapshots
OpenStack
reboot_instance- Soft reboot instanceresize_instance- Change instance flavor
Safety Checks
Pre-execution:
- System health check (CPU, memory, disk)
- Backup availability verification
- Rate limit check (max 10/hour)
- Time window check (maintenance hours)
Post-execution:
- Resource health verification
- Service availability check
- Performance metrics validation
Rollback
If any action fails:
- Stop execution immediately
- Log failure details
- Execute rollback procedures
- Notify administrators
- Update ticket status to
partially_remediated
Pattern Learning
Pattern Identification
# Generate pattern signature
pattern = {
'category': 'network',
'key_terms': ['vlan', 'connectivity', 'timeout'],
'hash': sha256(signature)
}
Pattern Statistics
Tracked for each pattern:
- Occurrence count: How many times seen
- Success/failure counts: Resolution outcomes
- Feedback distribution: Positive/negative/neutral
- Average confidence: Mean AI confidence
- Average reliability: Mean reliability score
- Auto-remediation success rate: % of successful auto-fixes
Pattern Eligibility
Pattern becomes eligible for auto-remediation when:
if (
pattern.occurrence_count >= 5 and
pattern.positive_feedback_rate >= 0.85 and
pattern.avg_reliability_score >= 85.0 and
pattern.auto_remediation_success_rate >= 0.85
):
pattern.eligible_for_auto_remediation = True
Pattern Evolution
Initial State:
├─ occurrence_count: 1
├─ eligible_for_auto_remediation: false
└─ Manual resolution only
After 5+ occurrences with good feedback:
├─ occurrence_count: 7
├─ positive_feedback_rate: 0.85
├─ avg_reliability_score: 87.0
├─ eligible_for_auto_remediation: true
└─ Can trigger auto-remediation
After 20+ occurrences:
├─ occurrence_count: 24
├─ auto_remediation_success_rate: 0.92
├─ Very high confidence
└─ Auto-remediation without approval
API Usage
Create Ticket with Auto-Remediation
curl -X POST http://localhost:8000/api/v1/tickets \
-H "Content-Type: application/json" \
-d '{
"ticket_id": "INC-12345",
"title": "Service down",
"description": "Web service not responding on port 8080",
"category": "server",
"enable_auto_remediation": true
}'
Response:
{
"ticket_id": "INC-12345",
"status": "processing",
"auto_remediation_enabled": true,
"confidence_score": 0.0,
"reliability_score": null
}
Check Ticket Status
curl http://localhost:8000/api/v1/tickets/INC-12345
Response:
{
"ticket_id": "INC-12345",
"status": "resolved",
"resolution": "Service was restarted successfully...",
"suggested_actions": [
{"action": "Restart web service", "system": "prod-web-01"}
],
"confidence_score": 0.92,
"reliability_score": 87.5,
"reliability_breakdown": {
"overall_score": 87.5,
"confidence_level": "high",
"breakdown": {...}
},
"auto_remediation_enabled": true,
"auto_remediation_executed": true,
"remediation_decision": {
"allowed": true,
"requires_approval": false,
"action_type": "safe_write"
},
"remediation_results": {
"success": true,
"executed_actions": [...]
}
}
Submit Feedback
curl -X POST http://localhost:8000/api/v1/feedback \
-H "Content-Type: application/json" \
-d '{
"ticket_id": "INC-12345",
"feedback_type": "positive",
"rating": 5,
"was_helpful": true,
"resolution_accurate": true,
"actions_worked": true,
"comment": "Perfect resolution, service is back up!"
}'
Approve Remediation
For tickets requiring approval:
curl -X POST http://localhost:8000/api/v1/tickets/INC-12345/approve-remediation \
-H "Content-Type: application/json" \
-d '{
"ticket_id": "INC-12345",
"approve": true,
"approver": "john.doe@company.com",
"comment": "Approved for execution"
}'
Get Analytics
# Reliability statistics
curl http://localhost:8000/api/v1/stats/reliability?days=30
# Auto-remediation statistics
curl http://localhost:8000/api/v1/stats/auto-remediation?days=30
# Learned patterns
curl http://localhost:8000/api/v1/patterns?category=network&min_occurrences=5
Configuration
Auto-Remediation Policy
policy = AutoRemediationPolicy(
name="network-auto-remediation",
category="network",
# Thresholds
min_confidence_score=0.85, # 85% AI confidence required
min_reliability_score=80.0, # 80% reliability required
min_similar_tickets=5, # Need 5+ similar resolved tickets
min_positive_feedback_rate=0.8, # 80% positive feedback required
# Allowed actions
allowed_action_types=["safe_write"],
allowed_systems=["network"],
forbidden_commands=["delete", "format", "shutdown"],
# Time restrictions
allowed_hours_start=22, # 10 PM
allowed_hours_end=6, # 6 AM
allowed_days=["monday", "tuesday", "wednesday", "thursday", "friday"],
# Approval
requires_approval=True,
auto_approve_threshold=90.0, # Auto-approve if reliability ≥ 90%
approvers=["admin@company.com"],
# Safety
max_actions_per_hour=10,
requires_rollback_plan=True,
requires_backup=True,
# Status
enabled=True
)
Environment Variables
# Enable/disable auto-remediation globally
AUTO_REMEDIATION_ENABLED=true
# Global safety settings
AUTO_REMEDIATION_MAX_ACTIONS_PER_HOUR=10
AUTO_REMEDIATION_REQUIRE_APPROVAL=true
AUTO_REMEDIATION_MIN_RELIABILITY=85.0
# Pattern learning
PATTERN_MIN_OCCURRENCES=5
PATTERN_MIN_POSITIVE_RATE=0.85
Monitoring & Analytics
Key Metrics
# Reliability metrics
- avg_reliability_score: Average across all tickets
- avg_confidence_score: Average AI confidence
- resolution_rate: % of tickets resolved
# Auto-remediation metrics
- execution_rate: % of enabled tickets that were auto-remediated
- success_rate: % of auto-remediation actions that succeeded
- approval_rate: % requiring human approval
# Feedback metrics
- positive_feedback_rate: % positive feedback
- negative_feedback_rate: % negative feedback
- avg_rating: Average star rating (1-5)
# Pattern metrics
- eligible_patterns: # of patterns eligible for auto-remediation
- pattern_success_rate: Success rate across all patterns
Grafana Dashboards
Example metrics:
# Reliability score trend
avg(datacenter_docs_reliability_score) by (category)
# Auto-remediation success rate
rate(datacenter_docs_auto_remediation_success_total[1h]) /
rate(datacenter_docs_auto_remediation_attempts_total[1h])
# Feedback sentiment
sum(datacenter_docs_feedback_total) by (type)
Alerts
# Low reliability alert
- alert: LowReliabilityScore
expr: avg(datacenter_docs_reliability_score) < 70
for: 1h
annotations:
summary: "Reliability score below threshold"
# High failure rate
- alert: HighAutoRemediationFailureRate
expr: rate(datacenter_docs_auto_remediation_failures_total[1h]) > 0.2
for: 15m
annotations:
summary: "Auto-remediation failure rate > 20%"
Best Practices
1. Start Conservative
- Enable auto-remediation for low-risk categories first (e.g., cache clearing)
- Set high thresholds initially (reliability ≥ 90%)
- Require approvals for first 20-30 occurrences
- Monitor closely and adjust based on results
2. Gradual Rollout
Week 1-2: Enable for 5% of tickets
Week 3-4: Increase to 20% if success rate > 90%
Week 5-6: Increase to 50% if success rate > 85%
Week 7+: Full rollout with dynamic thresholds
3. Category-Specific Policies
Different categories need different thresholds:
| Category | Min Reliability | Auto-Approve | Reason |
|---|---|---|---|
| Cache | 75% | 85% | Low risk, frequent |
| Network | 85% | 90% | Medium risk |
| Storage | 90% | 95% | High risk |
| Security | 95% | Never | Critical, always approve |
4. Human in the Loop
- Always collect feedback, even for successful auto-remediations
- Review logs weekly
- Adjust thresholds based on feedback trends
- Disable patterns with declining success rates
5. Continuous Learning
- System improves over time through feedback
- Patterns with 20+ occurrences and 90%+ success → Very high confidence
- Allow system to become more autonomous as reliability proves out
- But maintain human oversight for critical operations
Troubleshooting
Auto-remediation not executing
Check:
- Is
enable_auto_remediation: truein ticket? - Is there an active policy for the category?
- Does confidence/reliability meet thresholds?
- Are safety checks passing?
- Does pattern meet eligibility requirements?
Debug:
# Check decision
curl http://localhost:8000/api/v1/tickets/TICKET-ID | jq '.remediation_decision'
# Check logs
curl http://localhost:8000/api/v1/tickets/TICKET-ID/remediation-logs
Low reliability scores
Causes:
- Insufficient historical data
- Negative feedback on category
- Low pattern match confidence
- Recent failures in category
Solutions:
- Collect more feedback
- Review and improve resolutions
- Wait for more data points
- Manually resolve similar tickets successfully
Pattern not becoming eligible
Requirements not met:
- Need ≥5 occurrences
- Need ≥85% positive feedback
- Need ≥85% average reliability
Action:
- Continue resolving similar tickets
- Ensure feedback is being collected
- Check pattern stats:
GET /api/v1/patterns
Future Enhancements
- Multi-step reasoning: Complex workflows spanning multiple systems
- Predictive remediation: Fix issues before they cause incidents
- A/B testing: Compare different resolution strategies
- Reinforcement learning: Optimize actions based on outcomes
- Natural language explanations: Better transparency in decisions
- Cross-system orchestration: Coordinated actions across infrastructure
Summary
The Auto-Remediation System is designed for safe, gradual automation of infrastructure issue resolution:
- ✅ Disabled by default - explicit opt-in per ticket
- ✅ Multi-factor reliability - comprehensive confidence calculation
- ✅ Human feedback loop - continuous learning and improvement
- ✅ Pattern recognition - learns from similar issues
- ✅ Safety first - extensive checks, approval workflows, rollback
- ✅ Progressive automation - system becomes more autonomous over time
- ✅ Full observability - complete audit trail and analytics
Start small, monitor closely, scale gradually, and let the system learn.
For support: automation-team@company.local