Files
api7-demo/web/docs/troubleshooting.md
d.viti a2eef9efde
Some checks failed
Build and Push Docker Images / build-web (push) Failing after 1m3s
Build and Push Docker Images / build-api (push) Failing after 1m1s
first commit
2025-10-03 01:20:15 +02:00

703 lines
16 KiB
Markdown

# Troubleshooting Guide
Common issues and solutions for API Gateway deployment.
## Gateway Issues
### Gateway Pods Not Starting
**Symptoms**:
- Gateway pods in `CrashLoopBackOff` or `Error` state
- Pods continuously restarting
**Diagnosis**:
```bash
# Check pod status
kubectl get pods -n <namespace> -l app.kubernetes.io/name=gateway
# View pod logs
kubectl logs -n <namespace> <gateway-pod-name>
# Describe pod for events
kubectl describe pod -n <namespace> <gateway-pod-name>
```
**Common Causes & Solutions**:
**1. Configuration Store Connection Failure**
```bash
# Check Data Plane Manager is running
kubectl get pods -n <namespace> -l app=dp-manager
# Verify configuration endpoint
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml | grep endpoint
# Expected: Configuration store endpoint URL
```
**2. TLS Certificate Issues**
```bash
# Verify TLS secret exists
kubectl get secret <gateway-tls-secret> -n <namespace>
# Check certificate validity
kubectl get secret <gateway-tls-secret> -n <namespace> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
```
**3. Configuration Error**
```bash
# Check gateway ConfigMap
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml
# Validate YAML syntax
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml | yq eval '.'
```
### Gateway Returns 404 for All Routes
**Symptoms**:
- All requests return HTTP 404
- Routes configured but not working
**Diagnosis**:
```bash
# Check if routes are published in Dashboard
# Navigate: Services → <service> → Routes
# Verify: Route status shows "Published"
# Check gateway logs for route loading
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --tail=100 | grep -i route
```
**Solutions**:
**1. Routes Not Published**
- Routes synced via CLI are NOT active until published
- Publish each route in Dashboard:
- Services → Select service → Routes tab
- Click "Publish" for each route
- Select appropriate gateway group
**2. Wrong Gateway Group**
```bash
# Verify gateway group
kubectl get deployment <gateway-deployment> -n <namespace> -o yaml | grep GATEWAY_GROUP
# Expected: Configured group name
```
**3. Host Header Mismatch**
```bash
# Test with correct Host header
curl -H "Host: app.domain.com" http://<gateway-ip>/
# Check route configuration
<cli-tool> dump --server <dashboard-url> --token <TOKEN>
```
### Gateway Service Unavailable (503)
**Symptoms**:
- Requests return HTTP 503
- Gateway is running but can't reach backends
**Diagnosis**:
```bash
# Check backend service exists
kubectl get svc -n <namespace> <backend-service-name>
# Check backend endpoints
kubectl get endpoints -n <namespace> <backend-service-name>
# Check gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --tail=100 | grep -i "503\|upstream"
```
**Solutions**:
**1. Backend Service Not Found**
```bash
# Verify service exists
kubectl get svc -n <namespace>
# Check service name in route config matches
<cli-tool> dump --server <URL> --token <TOKEN> | grep -A 5 upstream
```
**2. No Healthy Endpoints**
```bash
# Check if pods are running
kubectl get pods -n <namespace> -l app=<backend-app>
# Verify endpoints exist
kubectl get endpoints -n <namespace> <service-name>
# If empty, check service selector
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A 3 selector
kubectl get pods -n <namespace> --show-labels | grep <label>
```
**3. Service Discovery Not Working**
- See [Service Discovery Issues](#service-discovery-not-working)
## Service Discovery Issues
### Service Registry Not Found
**Error**: `service registry not found` or `discovery failed`
**Diagnosis**:
```bash
# Check service registry in Dashboard
# Navigate: Settings → Service Registry
# Verify: Status shows "Connected" or "Healthy"
# Check gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway | grep -i discovery
```
**Solutions**:
**1. Service Registry Not Configured**
- Dashboard → Settings → Service Registry → Add Service Registry
- Type: Kubernetes
- Configure:
```
Name: kubernetes-cluster
API Server: https://kubernetes.default.svc.cluster.local:443
Token Path: /var/run/secrets/kubernetes.io/serviceaccount/token
```
**2. RBAC Permissions Missing**
```bash
# Check permissions
kubectl auth can-i list endpoints --as=system:serviceaccount:<namespace>:default
# If "no", create ClusterRoleBinding
kubectl create clusterrolebinding gateway-discovery \
--clusterrole=view \
--serviceaccount=<namespace>:default
```
**3. Service Port Not Named**
```bash
# Check service definition
kubectl get svc <service-name> -n <namespace> -o yaml
# Port MUST have a name:
ports:
- port: 80
targetPort: 8000
name: http # ← Required for service discovery
```
### Endpoints Not Discovered
**Symptoms**:
- Service discovery configured but endpoints not updating
- Scaling pods doesn't update gateway
**Diagnosis**:
```bash
# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>
# Scale pods and verify endpoints update
kubectl scale deployment <name> -n <namespace> --replicas=5
kubectl get endpoints -n <namespace> <service-name>
# Check gateway discovers endpoints
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway | grep -i endpoint
```
**Solutions**:
**1. Check Service Registry Connection**
```bash
# In Dashboard, verify registry status
# Settings → Service Registry → kubernetes-cluster
# Status should be "Connected"
```
**2. Verify Service Name Format**
```yaml
# Format: <namespace>/<service-name>:<port-name>
upstream:
discovery_type: kubernetes
service_name: <namespace>/web-service:http
# NOT: web-service or web-service.<namespace>.svc.cluster.local
```
**3. Restart Gateway Pods**
```bash
kubectl rollout restart deployment/<gateway-deployment> -n <namespace>
```
## Ingress & Certificate Issues
### Certificate Not Trusted / Invalid
**Symptoms**:
- Browser shows "Not Secure" warning
- Certificate errors in logs
**Diagnosis**:
```bash
# Check certificate
kubectl get certificate -n <namespace>
# Describe certificate for errors
kubectl describe certificate <cert-name> -n <namespace>
# Check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager --tail=50
# Test certificate
openssl s_client -connect demo.domain.com:443 -servername demo.domain.com < /dev/null 2>/dev/null | openssl x509 -noout -dates -issuer
```
**Solutions**:
**1. Certificate Not Ready**
```bash
# Check certificate status
kubectl get certificate -n <namespace>
# If not "True", check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager
```
**2. DNS Challenge Failed**
```bash
# Check ClusterIssuer
kubectl get clusterissuer
# Verify Cloudflare API token
kubectl get secret cloudflare-api-token-secret -n cert-manager
# Check challenge status
kubectl get challenge -A
```
**3. Manual Certificate Creation**
```bash
# If cert-manager fails, use acme.sh
export CF_Token="<CLOUDFLARE_TOKEN>"
~/.acme.sh/acme.sh --issue --dns dns_cf -d "*.domain.com" -d "domain.com"
# Create Kubernetes secret
kubectl create secret tls wildcard-tls \
--cert=~/.acme.sh/*.domain.com_ecc/fullchain.cer \
--key=~/.acme.sh/*.domain.com_ecc/*.domain.com.key \
-n <namespace>
```
### LoadBalancer Stuck in Pending
**Symptoms**:
- Ingress EXTERNAL-IP shows `<pending>`
- Cannot access services externally
**Diagnosis**:
```bash
# Check MetalLB
kubectl get pods -n metallb-system
# Check IPAddressPool
kubectl get ipaddresspool -A
# Check service
kubectl describe svc -n ingress-nginx nginx-ingress-lb-custom
```
**Solutions**:
**1. MetalLB Not Running**
```bash
# Check MetalLB pods
kubectl get pods -n metallb-system
# Restart if needed
kubectl rollout restart deployment -n metallb-system
```
**2. IP Pool Exhausted**
```bash
# Check IP pool configuration
kubectl get ipaddresspool -A -o yaml
# Check allocated IPs
kubectl get svc -A -o wide | grep LoadBalancer
```
**3. Annotation Error**
```bash
# Check MetalLB annotation
kubectl get svc <name> -n <namespace> -o yaml | grep metallb
# Correct format:
metadata:
annotations:
metallb.universe.tf/loadBalancerIPs: "<external-ip>"
```
### Ingress Returns 503 Backend Unavailable
**Symptoms**:
- NGINX Ingress returns 503
- Backend service is running
**Diagnosis**:
```bash
# Check ingress
kubectl describe ingress <name> -n <namespace>
# Check NGINX logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# Check backend service
kubectl get svc <backend-service> -n <namespace>
```
**Solutions**:
**1. Wrong Backend Service**
```bash
# Verify ingress points to Gateway Gateway, not application
kubectl get ingress <name> -n <namespace> -o yaml | grep -A 5 backend
# Should be:
backend:
service:
name: <gateway-deployment>-gateway
port:
number: 80
```
**2. Service Port Mismatch**
```bash
# Check service ports
kubectl get svc <gateway-deployment>-gateway -n <namespace>
# Ingress should point to port 80, not 9080
```
## Control Plane Issues
### Dashboard Not Accessible
**Symptoms**:
- Cannot access https://<dashboard-url>
- Connection timeout or refused
**Diagnosis**:
```bash
# Check dashboard pod
kubectl get pods -n <namespace> -l app=<namespace>3-dashboard
# Check dashboard service
kubectl get svc -n <namespace> <namespace>3-0-1759339083-dashboard
# Check ingress
kubectl get ingress -n <namespace> <namespace>3-0-1759339083-dashboard
```
**Solutions**:
**1. Pod Not Running**
```bash
# Check pod status
kubectl get pods -n <namespace> -l app=<namespace>3-dashboard
# View logs
kubectl logs -n <namespace> -l app=<namespace>3-dashboard
```
**2. Port Forward as Workaround**
```bash
kubectl port-forward -n <namespace> svc/<namespace>3-0-1759339083-dashboard 7080:7080
# Access at http://localhost:7080
```
### PostgreSQL Connection Failed
**Symptoms**:
- Dashboard/Portal shows database errors
- Logs show "connection refused" to PostgreSQL
**Diagnosis**:
```bash
# Check PostgreSQL pod
kubectl get pods -n <namespace> -l app=postgresql
# Check PostgreSQL service
kubectl get svc -n <namespace> postgresql
# Test connection from dashboard pod
kubectl exec -n <namespace> -it <dashboard-pod> -- psql -h postgresql -U <namespace> -d <namespace>
```
**Solutions**:
**1. PostgreSQL Pod Not Running**
```bash
# Check pod status
kubectl get pods -n <namespace> postgresql-0
# View logs
kubectl logs -n <namespace> postgresql-0
```
**2. Credentials Mismatch**
```bash
# Check credentials in secret
kubectl get secret postgresql -n <namespace> -o jsonpath='{.data.postgres-password}' | base64 -d
# Compare with DSN in dashboard config
kubectl get configmap <namespace>3-0-1759339083-dashboard-config -n <namespace> -o yaml | grep dsn
```
**3. Storage Issues**
```bash
# Check PVC
kubectl get pvc -n <namespace> data-postgresql-0
# If storage full, expand PVC (if storage class supports it)
kubectl patch pvc data-postgresql-0 -n <namespace> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
```
## Application Issues
### Image Pull Errors
**Error**: `ImagePullBackOff` or `ErrImagePull`
**Diagnosis**:
```bash
# Check pod status
kubectl get pods -n <namespace>
# Describe pod
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
# Check image name
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].image}'
```
**Solutions**:
**1. Registry Authentication**
```bash
# Create registry secret
kubectl create secret docker-registry registry-secret \
--docker-server=<registry-url> \
--docker-username=<USERNAME> \
--docker-password=<TOKEN> \
-n <namespace>
# Add to deployment
spec:
template:
spec:
imagePullSecrets:
- name: registry-secret
```
**2. Image Does Not Exist**
```bash
# Verify image exists
docker pull <registry-url>/web:main
# Check available tags via Gitea UI or API
curl -u <username>:<token> https://<registry-url>/api/v1/packages/demos
```
**3. Wrong Image Name**
```bash
# Correct format:
<registry-url>/web:main
# NOT:
<registry-url>:main # Missing /web
```
### Application Crashing
**Symptoms**:
- Pods in `CrashLoopBackOff`
- Application logs show errors
**Diagnosis**:
```bash
# Check pod logs
kubectl logs -n <namespace> <pod-name>
# Check previous pod logs (if restarted)
kubectl logs -n <namespace> <pod-name> --previous
# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
```
**Solutions**:
**1. Port Mismatch**
```bash
# Verify app runs on correct port
# Check Dockerfile CMD or deployment env vars
# Common issue: App runs on 8000, container expects 3000
```
**2. Missing Dependencies**
```bash
# Check application logs for import errors
kubectl logs -n <namespace> <pod-name>
# Rebuild image with correct requirements.txt
```
**3. Resource Limits**
```bash
# Check if OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"
# If memory limit too low, increase:
resources:
limits:
memory: 512Mi # Increase from 256Mi
```
## CLI / Configuration Issues
### CLI Sync Fails
**Error**: `failed to sync configuration` or authentication errors
**Diagnosis**:
```bash
# Test CLI connection
<cli-tool>ping \
--backend <namespace> \
--server https://<dashboard-url> \
--token <TOKEN> \
--tls-skip-verify
# Validate configuration file
<cli-tool>validate -f config.yaml
```
**Solutions**:
**1. Invalid Token**
```bash
# Generate new token in Dashboard
# User → API Tokens → Generate Token
# Test with new token
<cli-tool>sync -f config.yaml --server <URL> --token <NEW_TOKEN>
```
**2. YAML Syntax Error**
```bash
# Validate YAML
<cli-tool>validate -f config.yaml
# Or use yq/yamllint
yq eval '.' config.yaml
```
**3. SSL Certificate Error**
```bash
# Use --tls-skip-verify flag (for self-signed certs)
<cli-tool>sync -f config.yaml --server <URL> --token <TOKEN> --tls-skip-verify
```
### Configuration Not Applied
**Symptoms**:
- CLI sync succeeds but changes not visible
- Routes not working as expected
**Diagnosis**:
```bash
# Dump current configuration
<cli-tool>dump --backend <namespace> --server <URL> --token <TOKEN> > current.yaml
# Compare with expected
diff config.yaml current.yaml
```
**Solutions**:
**1. Routes Not Published**
- Synced routes are NOT active until published
- Publish via Dashboard UI
**2. Wrong Gateway Group**
```bash
# Specify correct gateway group
<cli-tool>sync -f config.yaml --gateway-group default
```
**3. Cache Issue**
```bash
# Restart gateway pods to force reload
kubectl rollout restart deployment/<gateway-deployment> -n <namespace>
```
## Useful Debugging Commands
### Check All Resources
```bash
# All resources in namespace
kubectl get all -n <namespace>
# Wide output with more details
kubectl get all -n <namespace> -o wide
# All resource types including configmaps, secrets
kubectl get all,cm,secret,ingress,pvc -n <namespace>
```
### Log Collection
```bash
# All gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --all-containers=true --tail=200
# Dashboard logs
kubectl logs -n <namespace> -l app=<namespace>3-dashboard --tail=100
# Stream logs in real-time
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway -f
```
### Network Testing
```bash
# Test from within cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- sh
# Inside pod:
curl http://<gateway-deployment>-gateway.<namespace>.svc.cluster.local
curl http://web-service.<namespace>.svc.cluster.local
```
### Performance Analysis
```bash
# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes
# Describe for resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
```
---
*Comprehensive troubleshooting guide for API Gateway infrastructure deployment.*