it-ops/llm-automation-docs-and-remediation-engine

Files

LLM Automation System 1ba5ce851d Initial commit: LLM Automation Docs & Remediation Engine v2.0

Features:
- Automated datacenter documentation generation
- MCP integration for device connectivity
- Auto-remediation engine with safety checks
- Multi-factor reliability scoring (0-100%)
- Human feedback learning loop
- Pattern recognition and continuous improvement
- Agentic chat support with AI
- API for ticket resolution
- Frontend React with Material-UI
- CI/CD pipelines (GitLab + Gitea)
- Docker & Kubernetes deployment
- Complete documentation and guides

v2.0 Highlights:
- Auto-remediation with write operations (disabled by default)
- Reliability calculator with 4-factor scoring
- Human feedback system for continuous learning
- Pattern-based progressive automation
- Approval workflow for critical actions
- Full audit trail and rollback capability

2025-10-17 23:47:28 +00:00

13 KiB

Raw Blame History

Requisiti Tecnici per LLM - Generazione Documentazione Datacenter

1. Capacità Richieste al LLM

1.1 Capabilities Fondamentali

Network Access: Connessioni SSH, HTTPS, SNMP
API Interaction: REST, SOAP, GraphQL
Code Execution: Python, Bash, PowerShell
File Operations: Lettura/scrittura file markdown
Database Access: MySQL, PostgreSQL, SQL Server

1.2 Librerie Python Richieste

# Networking e protocolli
pip install paramiko          # SSH connections
pip install pysnmp            # SNMP queries
pip install requests          # HTTP/REST APIs
pip install netmiko           # Network device automation

# Virtualizzazione
pip install pyvmomi           # VMware vSphere API
pip install proxmoxer         # Proxmox API
pip install libvirt-python    # KVM/QEMU

# Storage
pip install pure-storage      # Pure Storage API
pip install netapp-ontap      # NetApp API

# Database
pip install mysql-connector-python
pip install psycopg2          # PostgreSQL
pip install pymssql           # Microsoft SQL Server

# Monitoring
pip install zabbix-api        # Zabbix
pip install prometheus-client # Prometheus

# Cloud providers
pip install boto3             # AWS
pip install azure-mgmt        # Azure
pip install google-cloud      # GCP

# Utilities
pip install jinja2            # Template rendering
pip install pyyaml            # YAML parsing
pip install pandas            # Data analysis
pip install markdown          # Markdown generation

1.3 CLI Tools Required

# Network tools
apt-get install snmp snmp-mibs-downloader
apt-get install nmap
apt-get install netcat-openbsd

# Virtualization
apt-get install open-vm-tools  # VMware

# Monitoring
apt-get install nagios-plugins

# Storage
apt-get install nfs-common
apt-get install cifs-utils
apt-get install multipath-tools

# Database clients
apt-get install mysql-client
apt-get install postgresql-client

2. Accessi e Credenziali Necessarie

2.1 Formato Credenziali

Le credenziali devono essere fornite in un file sicuro (vault/encrypted):

# credentials.yaml (encrypted)
datacenter:
  
  # Network devices
  network:
    cisco_switches:
      username: admin
      password: ${ENCRYPTED}
      enable_password: ${ENCRYPTED}
    firewalls:
      api_key: ${ENCRYPTED}
      
  # Virtualization
  vmware:
    vcenter_host: vcenter.domain.local
    username: automation@vsphere.local
    password: ${ENCRYPTED}
    
  proxmox:
    host: proxmox.domain.local
    token_name: automation
    token_value: ${ENCRYPTED}
    
  # Storage
  storage_arrays:
    - name: SAN-01
      type: pure_storage
      api_token: ${ENCRYPTED}
      
  # Databases
  databases:
    asset_management:
      host: db.domain.local
      port: 3306
      username: readonly_user
      password: ${ENCRYPTED}
      database: asset_db
      
  # Monitoring
  monitoring:
    zabbix:
      url: https://zabbix.domain.local
      api_token: ${ENCRYPTED}
      
  # Backup
  backup:
    veeam:
      server: veeam.domain.local
      username: automation
      password: ${ENCRYPTED}

2.2 Permessi Minimi Richiesti

IMPORTANTE: Utilizzare SEMPRE account a permessi minimi (read-only dove possibile)

Sistema	Account Type	Permessi Richiesti
Network Devices	Read-only	show commands, SNMP read
VMware vCenter	Read-only	Global > Read-only role
Storage Arrays	Read-only	Monitoring/reporting access
Databases	SELECT only	Read access su schema asset
Monitoring	Read-only	View dashboards, metrics
Backup Software	Read-only	View jobs, reports

3. Connettività di Rete

3.1 Requisiti Rete

LLM Host deve poter raggiungere:

Management Network:
- VLAN 10: 10.0.10.0/24 (Infrastructure Management)
- VLAN 20: 10.0.20.0/24 (Server Management)
- VLAN 30: 10.0.30.0/24 (Storage Management)

Porte richieste:
- TCP 22   (SSH)
- TCP 443  (HTTPS)
- TCP 3306 (MySQL)
- TCP 5432 (PostgreSQL)
- TCP 1433 (MS SQL Server)
- UDP 161  (SNMP)
- TCP 8006 (Proxmox)

3.2 Firewall Rules

# Allow LLM host to management networks
Source: [LLM_HOST_IP]
Destination: Management Networks
Protocol: SSH, HTTPS, SNMP, Database ports
Action: ALLOW

# Deny all other traffic from LLM host
Source: [LLM_HOST_IP]
Destination: Production Networks
Action: DENY

4. Rate Limiting e Best Practices

4.1 API Call Limits

# Rispettare rate limits dei vendor
RATE_LIMITS = {
    'vmware_vcenter': {'calls_per_minute': 100},
    'network_devices': {'calls_per_minute': 10},
    'storage_api': {'calls_per_minute': 60},
    'monitoring_api': {'calls_per_minute': 300}
}

# Implementare retry logic con exponential backoff
import time
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise
                    delay = base_delay * (2 ** attempt)
                    time.sleep(delay)
        return wrapper
    return decorator

4.2 Concurrent Operations

# Limitare operazioni concorrenti
from concurrent.futures import ThreadPoolExecutor

MAX_WORKERS = 5  # Non saturare le risorse

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    futures = [executor.submit(query_device, device) for device in devices]
    results = [f.result() for f in futures]

5. Error Handling e Logging

5.1 Logging Configuration

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('/var/log/datacenter-docs/generation.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger('datacenter-docs')

5.2 Error Handling Strategy

class DataCollectionError(Exception):
    """Custom exception per errori di raccolta dati"""
    pass

try:
    data = collect_vmware_data()
except ConnectionError as e:
    logger.error(f"Cannot connect to vCenter: {e}")
    # Utilizzare dati cached se disponibili
    data = load_cached_data('vmware')
except AuthenticationError as e:
    logger.critical(f"Authentication failed: {e}")
    # Inviare alert al team
    send_alert("VMware auth failed")
except Exception as e:
    logger.exception(f"Unexpected error: {e}")
    # Continuare con dati parziali
    data = get_partial_data()

6. Caching e Performance

6.1 Cache Strategy

import redis
from datetime import timedelta

# Setup Redis per caching
cache = redis.Redis(host='localhost', port=6379, db=0)

def get_cached_or_fetch(key, fetch_function, ttl=3600):
    """Get from cache or fetch if not available"""
    cached = cache.get(key)
    if cached:
        logger.info(f"Cache hit for {key}")
        return json.loads(cached)
    
    logger.info(f"Cache miss for {key}, fetching...")
    data = fetch_function()
    cache.setex(key, ttl, json.dumps(data))
    return data

# Esempio uso
vmware_inventory = get_cached_or_fetch(
    'vmware_inventory',
    lambda: collect_vmware_inventory(),
    ttl=3600  # 1 hour
)

6.2 Dati da Cachare

1 ora: Performance metrics, status real-time
6 ore: Inventory, configurazioni
24 ore: Asset database, ownership info
7 giorni: Historical trends, capacity planning

7. Schedule di Esecuzione

7.1 Cron Schedule Raccomandato

# Aggiornamento documentazione completa - ogni 6 ore
0 */6 * * * /usr/local/bin/generate-datacenter-docs.sh --full

# Quick update (solo metrics) - ogni ora
0 * * * * /usr/local/bin/generate-datacenter-docs.sh --metrics-only

# Weekly comprehensive report - domenica notte
0 2 * * 0 /usr/local/bin/generate-datacenter-docs.sh --full --detailed

7.2 Script Wrapper Esempio

#!/bin/bash
# generate-datacenter-docs.sh

set -e

LOGFILE="/var/log/datacenter-docs/$(date +%Y%m%d_%H%M%S).log"
LOCKFILE="/var/run/datacenter-docs.lock"

# Prevent concurrent executions
if [ -f "$LOCKFILE" ]; then
    echo "Another instance is running. Exiting."
    exit 1
fi

touch "$LOCKFILE"
trap "rm -f $LOCKFILE" EXIT

# Activate virtual environment
source /opt/datacenter-docs/venv/bin/activate

# Run Python script with parameters
python3 /opt/datacenter-docs/main.py "$@" 2>&1 | tee -a "$LOGFILE"

# Cleanup old logs (keep 30 days)
find /var/log/datacenter-docs/ -name "*.log" -mtime +30 -delete

8. Output e Validazione

8.1 Post-Generation Checks

def validate_documentation(section_file):
    """Valida il documento generato"""
    
    checks = {
        'file_exists': os.path.exists(section_file),
        'not_empty': os.path.getsize(section_file) > 0,
        'valid_markdown': validate_markdown_syntax(section_file),
        'no_placeholders': not contains_placeholders(section_file),
        'token_limit': count_tokens(section_file) < 50000
    }
    
    if all(checks.values()):
        logger.info(f"✓ {section_file} validation passed")
        return True
    else:
        failed = [k for k, v in checks.items() if not v]
        logger.error(f"✗ {section_file} validation failed: {failed}")
        return False

def contains_placeholders(file_path):
    """Check per placeholders non sostituiti"""
    with open(file_path, 'r') as f:
        content = f.read()
    patterns = [r'\[.*?\]', r'\{.*?\}', r'TODO', r'FIXME']
    import re
    return any(re.search(p, content) for p in patterns)

8.2 Notification System

def send_completion_notification(success, sections_updated, errors):
    """Invia notifica a fine generazione"""
    
    message = f"""
    Datacenter Documentation Update
    
    Status: {'✓ SUCCESS' if success else '✗ FAILED'}
    Sections Updated: {', '.join(sections_updated)}
    Errors: {len(errors)}
    
    {'Errors:\n' + '\n'.join(errors) if errors else ''}
    
    Timestamp: {datetime.now().isoformat()}
    """
    
    # Send via multiple channels
    send_email(recipients=['ops-team@company.com'], subject='Doc Update', body=message)
    send_slack(channel='#datacenter-ops', message=message)
    # send_teams / send_webhook as needed

9. Security Considerations

9.1 Secrets Management

# NON salvare mai credenziali in chiaro
# Utilizzare sempre un vault

from cryptography.fernet import Fernet
import keyring

def get_credential(service, account):
    """Retrieve credential from OS keyring"""
    return keyring.get_password(service, account)

# Oppure HashiCorp Vault
import hvac

client = hvac.Client(url='https://vault.company.com')
client.auth.approle.login(role_id=ROLE_ID, secret_id=SECRET_ID)
credentials = client.secrets.kv.v2.read_secret_version(path='datacenter/creds')

9.2 Audit Trail

# Log TUTTE le operazioni per audit
audit_log = {
    'timestamp': datetime.now().isoformat(),
    'user': 'automation-account',
    'action': 'documentation_generation',
    'sections': sections_updated,
    'systems_accessed': list_of_systems,
    'duration': elapsed_time,
    'success': True/False
}

write_audit_log(audit_log)

10. Troubleshooting

10.1 Common Issues

Problema	Causa Probabile	Soluzione
Connection Timeout	Firewall/Network	Verificare connectivity, firewall rules
Authentication Failed	Credenziali errate/scadute	Ruotare credenziali, verificare vault
API Rate Limit	Troppe richieste	Implementare backoff, ridurre frequency
Incomplete Data	Source temporaneamente down	Usare cached data, generare partial doc
Token Limit Exceeded	Troppi dati in sezione	Rimuovere dati storici, ottimizzare formato

10.2 Debug Mode

# Abilitare debug per troubleshooting
DEBUG = os.getenv('DEBUG', 'False').lower() == 'true'

if DEBUG:
    logging.getLogger().setLevel(logging.DEBUG)
    # Salvare raw responses per analisi
    with open(f'debug_{timestamp}.json', 'w') as f:
        json.dump(raw_response, f, indent=2)

11. Testing

11.1 Unit Tests

import unittest

class TestDataCollection(unittest.TestCase):
    def test_vmware_connection(self):
        """Test connessione a vCenter"""
        result = test_vmware_connection()
        self.assertTrue(result.success)
        
    def test_data_validation(self):
        """Test validazione dati raccolti"""
        sample_data = load_sample_data()
        self.assertTrue(validate_data_structure(sample_data))

11.2 Integration Tests

# Test end-to-end in ambiente di test
./run-tests.sh --integration --environment=test

# Verificare che tutti i sistemi siano raggiungibili
./check-connectivity.sh

# Dry-run senza salvare
python3 main.py --dry-run --verbose

Checklist Pre-Deployment

Prima di mettere in produzione il sistema:

Tutte le librerie installate
Credenziali configurate in vault sicuro
Connectivity verificata verso tutti i sistemi
Permessi account automation validati (read-only)
Firewall rules approvate e configurate
Logging configurato e testato
Notification system testato
Cron jobs configurati
Backup documentazione esistente
Runbook operativo completato
Escalation path definito
DR procedure documentate

Documento Versione: 1.0
Ultimo Aggiornamento: 2025-01-XX
Owner: Automation Team

13 KiB Raw Blame History