Features: - Automated datacenter documentation generation - MCP integration for device connectivity - Auto-remediation engine with safety checks - Multi-factor reliability scoring (0-100%) - Human feedback learning loop - Pattern recognition and continuous improvement - Agentic chat support with AI - API for ticket resolution - Frontend React with Material-UI - CI/CD pipelines (GitLab + Gitea) - Docker & Kubernetes deployment - Complete documentation and guides v2.0 Highlights: - Auto-remediation with write operations (disabled by default) - Reliability calculator with 4-factor scoring - Human feedback system for continuous learning - Pattern-based progressive automation - Approval workflow for critical actions - Full audit trail and rollback capability
532 lines
13 KiB
Markdown
532 lines
13 KiB
Markdown
# Requisiti Tecnici per LLM - Generazione Documentazione Datacenter
|
|
|
|
## 1. Capacità Richieste al LLM
|
|
|
|
### 1.1 Capabilities Fondamentali
|
|
- **Network Access**: Connessioni SSH, HTTPS, SNMP
|
|
- **API Interaction**: REST, SOAP, GraphQL
|
|
- **Code Execution**: Python, Bash, PowerShell
|
|
- **File Operations**: Lettura/scrittura file markdown
|
|
- **Database Access**: MySQL, PostgreSQL, SQL Server
|
|
|
|
### 1.2 Librerie Python Richieste
|
|
```python
|
|
# Networking e protocolli
|
|
pip install paramiko # SSH connections
|
|
pip install pysnmp # SNMP queries
|
|
pip install requests # HTTP/REST APIs
|
|
pip install netmiko # Network device automation
|
|
|
|
# Virtualizzazione
|
|
pip install pyvmomi # VMware vSphere API
|
|
pip install proxmoxer # Proxmox API
|
|
pip install libvirt-python # KVM/QEMU
|
|
|
|
# Storage
|
|
pip install pure-storage # Pure Storage API
|
|
pip install netapp-ontap # NetApp API
|
|
|
|
# Database
|
|
pip install mysql-connector-python
|
|
pip install psycopg2 # PostgreSQL
|
|
pip install pymssql # Microsoft SQL Server
|
|
|
|
# Monitoring
|
|
pip install zabbix-api # Zabbix
|
|
pip install prometheus-client # Prometheus
|
|
|
|
# Cloud providers
|
|
pip install boto3 # AWS
|
|
pip install azure-mgmt # Azure
|
|
pip install google-cloud # GCP
|
|
|
|
# Utilities
|
|
pip install jinja2 # Template rendering
|
|
pip install pyyaml # YAML parsing
|
|
pip install pandas # Data analysis
|
|
pip install markdown # Markdown generation
|
|
```
|
|
|
|
### 1.3 CLI Tools Required
|
|
```bash
|
|
# Network tools
|
|
apt-get install snmp snmp-mibs-downloader
|
|
apt-get install nmap
|
|
apt-get install netcat-openbsd
|
|
|
|
# Virtualization
|
|
apt-get install open-vm-tools # VMware
|
|
|
|
# Monitoring
|
|
apt-get install nagios-plugins
|
|
|
|
# Storage
|
|
apt-get install nfs-common
|
|
apt-get install cifs-utils
|
|
apt-get install multipath-tools
|
|
|
|
# Database clients
|
|
apt-get install mysql-client
|
|
apt-get install postgresql-client
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Accessi e Credenziali Necessarie
|
|
|
|
### 2.1 Formato Credenziali
|
|
Le credenziali devono essere fornite in un file sicuro (vault/encrypted):
|
|
|
|
```yaml
|
|
# credentials.yaml (encrypted)
|
|
datacenter:
|
|
|
|
# Network devices
|
|
network:
|
|
cisco_switches:
|
|
username: admin
|
|
password: ${ENCRYPTED}
|
|
enable_password: ${ENCRYPTED}
|
|
firewalls:
|
|
api_key: ${ENCRYPTED}
|
|
|
|
# Virtualization
|
|
vmware:
|
|
vcenter_host: vcenter.domain.local
|
|
username: automation@vsphere.local
|
|
password: ${ENCRYPTED}
|
|
|
|
proxmox:
|
|
host: proxmox.domain.local
|
|
token_name: automation
|
|
token_value: ${ENCRYPTED}
|
|
|
|
# Storage
|
|
storage_arrays:
|
|
- name: SAN-01
|
|
type: pure_storage
|
|
api_token: ${ENCRYPTED}
|
|
|
|
# Databases
|
|
databases:
|
|
asset_management:
|
|
host: db.domain.local
|
|
port: 3306
|
|
username: readonly_user
|
|
password: ${ENCRYPTED}
|
|
database: asset_db
|
|
|
|
# Monitoring
|
|
monitoring:
|
|
zabbix:
|
|
url: https://zabbix.domain.local
|
|
api_token: ${ENCRYPTED}
|
|
|
|
# Backup
|
|
backup:
|
|
veeam:
|
|
server: veeam.domain.local
|
|
username: automation
|
|
password: ${ENCRYPTED}
|
|
```
|
|
|
|
### 2.2 Permessi Minimi Richiesti
|
|
**IMPORTANTE**: Utilizzare SEMPRE account a permessi minimi (read-only dove possibile)
|
|
|
|
| Sistema | Account Type | Permessi Richiesti |
|
|
|---------|-------------|-------------------|
|
|
| Network Devices | Read-only | show commands, SNMP read |
|
|
| VMware vCenter | Read-only | Global > Read-only role |
|
|
| Storage Arrays | Read-only | Monitoring/reporting access |
|
|
| Databases | SELECT only | Read access su schema asset |
|
|
| Monitoring | Read-only | View dashboards, metrics |
|
|
| Backup Software | Read-only | View jobs, reports |
|
|
|
|
---
|
|
|
|
## 3. Connettività di Rete
|
|
|
|
### 3.1 Requisiti Rete
|
|
```
|
|
LLM Host deve poter raggiungere:
|
|
|
|
Management Network:
|
|
- VLAN 10: 10.0.10.0/24 (Infrastructure Management)
|
|
- VLAN 20: 10.0.20.0/24 (Server Management)
|
|
- VLAN 30: 10.0.30.0/24 (Storage Management)
|
|
|
|
Porte richieste:
|
|
- TCP 22 (SSH)
|
|
- TCP 443 (HTTPS)
|
|
- TCP 3306 (MySQL)
|
|
- TCP 5432 (PostgreSQL)
|
|
- TCP 1433 (MS SQL Server)
|
|
- UDP 161 (SNMP)
|
|
- TCP 8006 (Proxmox)
|
|
```
|
|
|
|
### 3.2 Firewall Rules
|
|
```
|
|
# Allow LLM host to management networks
|
|
Source: [LLM_HOST_IP]
|
|
Destination: Management Networks
|
|
Protocol: SSH, HTTPS, SNMP, Database ports
|
|
Action: ALLOW
|
|
|
|
# Deny all other traffic from LLM host
|
|
Source: [LLM_HOST_IP]
|
|
Destination: Production Networks
|
|
Action: DENY
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Rate Limiting e Best Practices
|
|
|
|
### 4.1 API Call Limits
|
|
```python
|
|
# Rispettare rate limits dei vendor
|
|
RATE_LIMITS = {
|
|
'vmware_vcenter': {'calls_per_minute': 100},
|
|
'network_devices': {'calls_per_minute': 10},
|
|
'storage_api': {'calls_per_minute': 60},
|
|
'monitoring_api': {'calls_per_minute': 300}
|
|
}
|
|
|
|
# Implementare retry logic con exponential backoff
|
|
import time
|
|
from functools import wraps
|
|
|
|
def retry_with_backoff(max_retries=3, base_delay=1):
|
|
def decorator(func):
|
|
@wraps(func)
|
|
def wrapper(*args, **kwargs):
|
|
for attempt in range(max_retries):
|
|
try:
|
|
return func(*args, **kwargs)
|
|
except Exception as e:
|
|
if attempt == max_retries - 1:
|
|
raise
|
|
delay = base_delay * (2 ** attempt)
|
|
time.sleep(delay)
|
|
return wrapper
|
|
return decorator
|
|
```
|
|
|
|
### 4.2 Concurrent Operations
|
|
```python
|
|
# Limitare operazioni concorrenti
|
|
from concurrent.futures import ThreadPoolExecutor
|
|
|
|
MAX_WORKERS = 5 # Non saturare le risorse
|
|
|
|
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
|
|
futures = [executor.submit(query_device, device) for device in devices]
|
|
results = [f.result() for f in futures]
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Error Handling e Logging
|
|
|
|
### 5.1 Logging Configuration
|
|
```python
|
|
import logging
|
|
|
|
logging.basicConfig(
|
|
level=logging.INFO,
|
|
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
|
handlers=[
|
|
logging.FileHandler('/var/log/datacenter-docs/generation.log'),
|
|
logging.StreamHandler()
|
|
]
|
|
)
|
|
|
|
logger = logging.getLogger('datacenter-docs')
|
|
```
|
|
|
|
### 5.2 Error Handling Strategy
|
|
```python
|
|
class DataCollectionError(Exception):
|
|
"""Custom exception per errori di raccolta dati"""
|
|
pass
|
|
|
|
try:
|
|
data = collect_vmware_data()
|
|
except ConnectionError as e:
|
|
logger.error(f"Cannot connect to vCenter: {e}")
|
|
# Utilizzare dati cached se disponibili
|
|
data = load_cached_data('vmware')
|
|
except AuthenticationError as e:
|
|
logger.critical(f"Authentication failed: {e}")
|
|
# Inviare alert al team
|
|
send_alert("VMware auth failed")
|
|
except Exception as e:
|
|
logger.exception(f"Unexpected error: {e}")
|
|
# Continuare con dati parziali
|
|
data = get_partial_data()
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Caching e Performance
|
|
|
|
### 6.1 Cache Strategy
|
|
```python
|
|
import redis
|
|
from datetime import timedelta
|
|
|
|
# Setup Redis per caching
|
|
cache = redis.Redis(host='localhost', port=6379, db=0)
|
|
|
|
def get_cached_or_fetch(key, fetch_function, ttl=3600):
|
|
"""Get from cache or fetch if not available"""
|
|
cached = cache.get(key)
|
|
if cached:
|
|
logger.info(f"Cache hit for {key}")
|
|
return json.loads(cached)
|
|
|
|
logger.info(f"Cache miss for {key}, fetching...")
|
|
data = fetch_function()
|
|
cache.setex(key, ttl, json.dumps(data))
|
|
return data
|
|
|
|
# Esempio uso
|
|
vmware_inventory = get_cached_or_fetch(
|
|
'vmware_inventory',
|
|
lambda: collect_vmware_inventory(),
|
|
ttl=3600 # 1 hour
|
|
)
|
|
```
|
|
|
|
### 6.2 Dati da Cachare
|
|
- **1 ora**: Performance metrics, status real-time
|
|
- **6 ore**: Inventory, configurazioni
|
|
- **24 ore**: Asset database, ownership info
|
|
- **7 giorni**: Historical trends, capacity planning
|
|
|
|
---
|
|
|
|
## 7. Schedule di Esecuzione
|
|
|
|
### 7.1 Cron Schedule Raccomandato
|
|
```cron
|
|
# Aggiornamento documentazione completa - ogni 6 ore
|
|
0 */6 * * * /usr/local/bin/generate-datacenter-docs.sh --full
|
|
|
|
# Quick update (solo metrics) - ogni ora
|
|
0 * * * * /usr/local/bin/generate-datacenter-docs.sh --metrics-only
|
|
|
|
# Weekly comprehensive report - domenica notte
|
|
0 2 * * 0 /usr/local/bin/generate-datacenter-docs.sh --full --detailed
|
|
```
|
|
|
|
### 7.2 Script Wrapper Esempio
|
|
```bash
|
|
#!/bin/bash
|
|
# generate-datacenter-docs.sh
|
|
|
|
set -e
|
|
|
|
LOGFILE="/var/log/datacenter-docs/$(date +%Y%m%d_%H%M%S).log"
|
|
LOCKFILE="/var/run/datacenter-docs.lock"
|
|
|
|
# Prevent concurrent executions
|
|
if [ -f "$LOCKFILE" ]; then
|
|
echo "Another instance is running. Exiting."
|
|
exit 1
|
|
fi
|
|
|
|
touch "$LOCKFILE"
|
|
trap "rm -f $LOCKFILE" EXIT
|
|
|
|
# Activate virtual environment
|
|
source /opt/datacenter-docs/venv/bin/activate
|
|
|
|
# Run Python script with parameters
|
|
python3 /opt/datacenter-docs/main.py "$@" 2>&1 | tee -a "$LOGFILE"
|
|
|
|
# Cleanup old logs (keep 30 days)
|
|
find /var/log/datacenter-docs/ -name "*.log" -mtime +30 -delete
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Output e Validazione
|
|
|
|
### 8.1 Post-Generation Checks
|
|
```python
|
|
def validate_documentation(section_file):
|
|
"""Valida il documento generato"""
|
|
|
|
checks = {
|
|
'file_exists': os.path.exists(section_file),
|
|
'not_empty': os.path.getsize(section_file) > 0,
|
|
'valid_markdown': validate_markdown_syntax(section_file),
|
|
'no_placeholders': not contains_placeholders(section_file),
|
|
'token_limit': count_tokens(section_file) < 50000
|
|
}
|
|
|
|
if all(checks.values()):
|
|
logger.info(f"✓ {section_file} validation passed")
|
|
return True
|
|
else:
|
|
failed = [k for k, v in checks.items() if not v]
|
|
logger.error(f"✗ {section_file} validation failed: {failed}")
|
|
return False
|
|
|
|
def contains_placeholders(file_path):
|
|
"""Check per placeholders non sostituiti"""
|
|
with open(file_path, 'r') as f:
|
|
content = f.read()
|
|
patterns = [r'\[.*?\]', r'\{.*?\}', r'TODO', r'FIXME']
|
|
import re
|
|
return any(re.search(p, content) for p in patterns)
|
|
```
|
|
|
|
### 8.2 Notification System
|
|
```python
|
|
def send_completion_notification(success, sections_updated, errors):
|
|
"""Invia notifica a fine generazione"""
|
|
|
|
message = f"""
|
|
Datacenter Documentation Update
|
|
|
|
Status: {'✓ SUCCESS' if success else '✗ FAILED'}
|
|
Sections Updated: {', '.join(sections_updated)}
|
|
Errors: {len(errors)}
|
|
|
|
{'Errors:\n' + '\n'.join(errors) if errors else ''}
|
|
|
|
Timestamp: {datetime.now().isoformat()}
|
|
"""
|
|
|
|
# Send via multiple channels
|
|
send_email(recipients=['ops-team@company.com'], subject='Doc Update', body=message)
|
|
send_slack(channel='#datacenter-ops', message=message)
|
|
# send_teams / send_webhook as needed
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Security Considerations
|
|
|
|
### 9.1 Secrets Management
|
|
```python
|
|
# NON salvare mai credenziali in chiaro
|
|
# Utilizzare sempre un vault
|
|
|
|
from cryptography.fernet import Fernet
|
|
import keyring
|
|
|
|
def get_credential(service, account):
|
|
"""Retrieve credential from OS keyring"""
|
|
return keyring.get_password(service, account)
|
|
|
|
# Oppure HashiCorp Vault
|
|
import hvac
|
|
|
|
client = hvac.Client(url='https://vault.company.com')
|
|
client.auth.approle.login(role_id=ROLE_ID, secret_id=SECRET_ID)
|
|
credentials = client.secrets.kv.v2.read_secret_version(path='datacenter/creds')
|
|
```
|
|
|
|
### 9.2 Audit Trail
|
|
```python
|
|
# Log TUTTE le operazioni per audit
|
|
audit_log = {
|
|
'timestamp': datetime.now().isoformat(),
|
|
'user': 'automation-account',
|
|
'action': 'documentation_generation',
|
|
'sections': sections_updated,
|
|
'systems_accessed': list_of_systems,
|
|
'duration': elapsed_time,
|
|
'success': True/False
|
|
}
|
|
|
|
write_audit_log(audit_log)
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Troubleshooting
|
|
|
|
### 10.1 Common Issues
|
|
|
|
| Problema | Causa Probabile | Soluzione |
|
|
|----------|----------------|-----------|
|
|
| Connection Timeout | Firewall/Network | Verificare connectivity, firewall rules |
|
|
| Authentication Failed | Credenziali errate/scadute | Ruotare credenziali, verificare vault |
|
|
| API Rate Limit | Troppe richieste | Implementare backoff, ridurre frequency |
|
|
| Incomplete Data | Source temporaneamente down | Usare cached data, generare partial doc |
|
|
| Token Limit Exceeded | Troppi dati in sezione | Rimuovere dati storici, ottimizzare formato |
|
|
|
|
### 10.2 Debug Mode
|
|
```python
|
|
# Abilitare debug per troubleshooting
|
|
DEBUG = os.getenv('DEBUG', 'False').lower() == 'true'
|
|
|
|
if DEBUG:
|
|
logging.getLogger().setLevel(logging.DEBUG)
|
|
# Salvare raw responses per analisi
|
|
with open(f'debug_{timestamp}.json', 'w') as f:
|
|
json.dump(raw_response, f, indent=2)
|
|
```
|
|
|
|
---
|
|
|
|
## 11. Testing
|
|
|
|
### 11.1 Unit Tests
|
|
```python
|
|
import unittest
|
|
|
|
class TestDataCollection(unittest.TestCase):
|
|
def test_vmware_connection(self):
|
|
"""Test connessione a vCenter"""
|
|
result = test_vmware_connection()
|
|
self.assertTrue(result.success)
|
|
|
|
def test_data_validation(self):
|
|
"""Test validazione dati raccolti"""
|
|
sample_data = load_sample_data()
|
|
self.assertTrue(validate_data_structure(sample_data))
|
|
```
|
|
|
|
### 11.2 Integration Tests
|
|
```bash
|
|
# Test end-to-end in ambiente di test
|
|
./run-tests.sh --integration --environment=test
|
|
|
|
# Verificare che tutti i sistemi siano raggiungibili
|
|
./check-connectivity.sh
|
|
|
|
# Dry-run senza salvare
|
|
python3 main.py --dry-run --verbose
|
|
```
|
|
|
|
---
|
|
|
|
## Checklist Pre-Deployment
|
|
|
|
Prima di mettere in produzione il sistema:
|
|
|
|
- [ ] Tutte le librerie installate
|
|
- [ ] Credenziali configurate in vault sicuro
|
|
- [ ] Connectivity verificata verso tutti i sistemi
|
|
- [ ] Permessi account automation validati (read-only)
|
|
- [ ] Firewall rules approvate e configurate
|
|
- [ ] Logging configurato e testato
|
|
- [ ] Notification system testato
|
|
- [ ] Cron jobs configurati
|
|
- [ ] Backup documentazione esistente
|
|
- [ ] Runbook operativo completato
|
|
- [ ] Escalation path definito
|
|
- [ ] DR procedure documentate
|
|
|
|
---
|
|
|
|
**Documento Versione**: 1.0
|
|
**Ultimo Aggiornamento**: 2025-01-XX
|
|
**Owner**: Automation Team
|