Files
api7-demo/web/docs/troubleshooting.md
d.viti a2eef9efde
Some checks failed
Build and Push Docker Images / build-web (push) Failing after 1m3s
Build and Push Docker Images / build-api (push) Failing after 1m1s
first commit
2025-10-03 01:20:15 +02:00

16 KiB

Troubleshooting Guide

Common issues and solutions for API Gateway deployment.

Gateway Issues

Gateway Pods Not Starting

Symptoms:

  • Gateway pods in CrashLoopBackOff or Error state
  • Pods continuously restarting

Diagnosis:

# Check pod status
kubectl get pods -n <namespace> -l app.kubernetes.io/name=gateway

# View pod logs
kubectl logs -n <namespace> <gateway-pod-name>

# Describe pod for events
kubectl describe pod -n <namespace> <gateway-pod-name>

Common Causes & Solutions:

1. Configuration Store Connection Failure

# Check Data Plane Manager is running
kubectl get pods -n <namespace> -l app=dp-manager

# Verify configuration endpoint
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml | grep endpoint

# Expected: Configuration store endpoint URL

2. TLS Certificate Issues

# Verify TLS secret exists
kubectl get secret <gateway-tls-secret> -n <namespace>

# Check certificate validity
kubectl get secret <gateway-tls-secret> -n <namespace> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

3. Configuration Error

# Check gateway ConfigMap
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml

# Validate YAML syntax
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml | yq eval '.'

Gateway Returns 404 for All Routes

Symptoms:

  • All requests return HTTP 404
  • Routes configured but not working

Diagnosis:

# Check if routes are published in Dashboard
# Navigate: Services → <service> → Routes
# Verify: Route status shows "Published"

# Check gateway logs for route loading
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --tail=100 | grep -i route

Solutions:

1. Routes Not Published

  • Routes synced via CLI are NOT active until published
  • Publish each route in Dashboard:
    • Services → Select service → Routes tab
    • Click "Publish" for each route
    • Select appropriate gateway group

2. Wrong Gateway Group

# Verify gateway group
kubectl get deployment <gateway-deployment> -n <namespace> -o yaml | grep GATEWAY_GROUP

# Expected: Configured group name

3. Host Header Mismatch

# Test with correct Host header
curl -H "Host: app.domain.com" http://<gateway-ip>/

# Check route configuration
<cli-tool> dump --server <dashboard-url> --token <TOKEN>

Gateway Service Unavailable (503)

Symptoms:

  • Requests return HTTP 503
  • Gateway is running but can't reach backends

Diagnosis:

# Check backend service exists
kubectl get svc -n <namespace> <backend-service-name>

# Check backend endpoints
kubectl get endpoints -n <namespace> <backend-service-name>

# Check gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --tail=100 | grep -i "503\|upstream"

Solutions:

1. Backend Service Not Found

# Verify service exists
kubectl get svc -n <namespace>

# Check service name in route config matches
<cli-tool> dump --server <URL> --token <TOKEN> | grep -A 5 upstream

2. No Healthy Endpoints

# Check if pods are running
kubectl get pods -n <namespace> -l app=<backend-app>

# Verify endpoints exist
kubectl get endpoints -n <namespace> <service-name>

# If empty, check service selector
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A 3 selector
kubectl get pods -n <namespace> --show-labels | grep <label>

3. Service Discovery Not Working

Service Discovery Issues

Service Registry Not Found

Error: service registry not found or discovery failed

Diagnosis:

# Check service registry in Dashboard
# Navigate: Settings → Service Registry
# Verify: Status shows "Connected" or "Healthy"

# Check gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway | grep -i discovery

Solutions:

1. Service Registry Not Configured

  • Dashboard → Settings → Service Registry → Add Service Registry
  • Type: Kubernetes
  • Configure:
    Name: kubernetes-cluster
    API Server: https://kubernetes.default.svc.cluster.local:443
    Token Path: /var/run/secrets/kubernetes.io/serviceaccount/token
    

2. RBAC Permissions Missing

# Check permissions
kubectl auth can-i list endpoints --as=system:serviceaccount:<namespace>:default

# If "no", create ClusterRoleBinding
kubectl create clusterrolebinding gateway-discovery \
  --clusterrole=view \
  --serviceaccount=<namespace>:default

3. Service Port Not Named

# Check service definition
kubectl get svc <service-name> -n <namespace> -o yaml

# Port MUST have a name:
ports:
- port: 80
  targetPort: 8000
  name: http    # ← Required for service discovery

Endpoints Not Discovered

Symptoms:

  • Service discovery configured but endpoints not updating
  • Scaling pods doesn't update gateway

Diagnosis:

# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>

# Scale pods and verify endpoints update
kubectl scale deployment <name> -n <namespace> --replicas=5
kubectl get endpoints -n <namespace> <service-name>

# Check gateway discovers endpoints
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway | grep -i endpoint

Solutions:

1. Check Service Registry Connection

# In Dashboard, verify registry status
# Settings → Service Registry → kubernetes-cluster
# Status should be "Connected"

2. Verify Service Name Format

# Format: <namespace>/<service-name>:<port-name>
upstream:
  discovery_type: kubernetes
  service_name: <namespace>/web-service:http
  # NOT: web-service or web-service.<namespace>.svc.cluster.local

3. Restart Gateway Pods

kubectl rollout restart deployment/<gateway-deployment> -n <namespace>

Ingress & Certificate Issues

Certificate Not Trusted / Invalid

Symptoms:

  • Browser shows "Not Secure" warning
  • Certificate errors in logs

Diagnosis:

# Check certificate
kubectl get certificate -n <namespace>

# Describe certificate for errors
kubectl describe certificate <cert-name> -n <namespace>

# Check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager --tail=50

# Test certificate
openssl s_client -connect demo.domain.com:443 -servername demo.domain.com < /dev/null 2>/dev/null | openssl x509 -noout -dates -issuer

Solutions:

1. Certificate Not Ready

# Check certificate status
kubectl get certificate -n <namespace>

# If not "True", check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager

2. DNS Challenge Failed

# Check ClusterIssuer
kubectl get clusterissuer

# Verify Cloudflare API token
kubectl get secret cloudflare-api-token-secret -n cert-manager

# Check challenge status
kubectl get challenge -A

3. Manual Certificate Creation

# If cert-manager fails, use acme.sh
export CF_Token="<CLOUDFLARE_TOKEN>"
~/.acme.sh/acme.sh --issue --dns dns_cf -d "*.domain.com" -d "domain.com"

# Create Kubernetes secret
kubectl create secret tls wildcard-tls \
  --cert=~/.acme.sh/*.domain.com_ecc/fullchain.cer \
  --key=~/.acme.sh/*.domain.com_ecc/*.domain.com.key \
  -n <namespace>

LoadBalancer Stuck in Pending

Symptoms:

  • Ingress EXTERNAL-IP shows <pending>
  • Cannot access services externally

Diagnosis:

# Check MetalLB
kubectl get pods -n metallb-system

# Check IPAddressPool
kubectl get ipaddresspool -A

# Check service
kubectl describe svc -n ingress-nginx nginx-ingress-lb-custom

Solutions:

1. MetalLB Not Running

# Check MetalLB pods
kubectl get pods -n metallb-system

# Restart if needed
kubectl rollout restart deployment -n metallb-system

2. IP Pool Exhausted

# Check IP pool configuration
kubectl get ipaddresspool -A -o yaml

# Check allocated IPs
kubectl get svc -A -o wide | grep LoadBalancer

3. Annotation Error

# Check MetalLB annotation
kubectl get svc <name> -n <namespace> -o yaml | grep metallb

# Correct format:
metadata:
  annotations:
    metallb.universe.tf/loadBalancerIPs: "<external-ip>"

Ingress Returns 503 Backend Unavailable

Symptoms:

  • NGINX Ingress returns 503
  • Backend service is running

Diagnosis:

# Check ingress
kubectl describe ingress <name> -n <namespace>

# Check NGINX logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Check backend service
kubectl get svc <backend-service> -n <namespace>

Solutions:

1. Wrong Backend Service

# Verify ingress points to Gateway Gateway, not application
kubectl get ingress <name> -n <namespace> -o yaml | grep -A 5 backend

# Should be:
backend:
  service:
    name: <gateway-deployment>-gateway
    port:
      number: 80

2. Service Port Mismatch

# Check service ports
kubectl get svc <gateway-deployment>-gateway -n <namespace>

# Ingress should point to port 80, not 9080

Control Plane Issues

Dashboard Not Accessible

Symptoms:

  • Cannot access https://
  • Connection timeout or refused

Diagnosis:

# Check dashboard pod
kubectl get pods -n <namespace> -l app=<namespace>3-dashboard

# Check dashboard service
kubectl get svc -n <namespace> <namespace>3-0-1759339083-dashboard

# Check ingress
kubectl get ingress -n <namespace> <namespace>3-0-1759339083-dashboard

Solutions:

1. Pod Not Running

# Check pod status
kubectl get pods -n <namespace> -l app=<namespace>3-dashboard

# View logs
kubectl logs -n <namespace> -l app=<namespace>3-dashboard

2. Port Forward as Workaround

kubectl port-forward -n <namespace> svc/<namespace>3-0-1759339083-dashboard 7080:7080

# Access at http://localhost:7080

PostgreSQL Connection Failed

Symptoms:

  • Dashboard/Portal shows database errors
  • Logs show "connection refused" to PostgreSQL

Diagnosis:

# Check PostgreSQL pod
kubectl get pods -n <namespace> -l app=postgresql

# Check PostgreSQL service
kubectl get svc -n <namespace> postgresql

# Test connection from dashboard pod
kubectl exec -n <namespace> -it <dashboard-pod> -- psql -h postgresql -U <namespace> -d <namespace>

Solutions:

1. PostgreSQL Pod Not Running

# Check pod status
kubectl get pods -n <namespace> postgresql-0

# View logs
kubectl logs -n <namespace> postgresql-0

2. Credentials Mismatch

# Check credentials in secret
kubectl get secret postgresql -n <namespace> -o jsonpath='{.data.postgres-password}' | base64 -d

# Compare with DSN in dashboard config
kubectl get configmap <namespace>3-0-1759339083-dashboard-config -n <namespace> -o yaml | grep dsn

3. Storage Issues

# Check PVC
kubectl get pvc -n <namespace> data-postgresql-0

# If storage full, expand PVC (if storage class supports it)
kubectl patch pvc data-postgresql-0 -n <namespace> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

Application Issues

Image Pull Errors

Error: ImagePullBackOff or ErrImagePull

Diagnosis:

# Check pod status
kubectl get pods -n <namespace>

# Describe pod
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

# Check image name
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].image}'

Solutions:

1. Registry Authentication

# Create registry secret
kubectl create secret docker-registry registry-secret \
  --docker-server=<registry-url> \
  --docker-username=<USERNAME> \
  --docker-password=<TOKEN> \
  -n <namespace>

# Add to deployment
spec:
  template:
    spec:
      imagePullSecrets:
      - name: registry-secret

2. Image Does Not Exist

# Verify image exists
docker pull <registry-url>/web:main

# Check available tags via Gitea UI or API
curl -u <username>:<token> https://<registry-url>/api/v1/packages/demos

3. Wrong Image Name

# Correct format:
<registry-url>/web:main

# NOT:
<registry-url>:main  # Missing /web

Application Crashing

Symptoms:

  • Pods in CrashLoopBackOff
  • Application logs show errors

Diagnosis:

# Check pod logs
kubectl logs -n <namespace> <pod-name>

# Check previous pod logs (if restarted)
kubectl logs -n <namespace> <pod-name> --previous

# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Solutions:

1. Port Mismatch

# Verify app runs on correct port
# Check Dockerfile CMD or deployment env vars

# Common issue: App runs on 8000, container expects 3000

2. Missing Dependencies

# Check application logs for import errors
kubectl logs -n <namespace> <pod-name>

# Rebuild image with correct requirements.txt

3. Resource Limits

# Check if OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"

# If memory limit too low, increase:
resources:
  limits:
    memory: 512Mi  # Increase from 256Mi

CLI / Configuration Issues

CLI Sync Fails

Error: failed to sync configuration or authentication errors

Diagnosis:

# Test CLI connection
<cli-tool>ping \
  --backend <namespace> \
  --server https://<dashboard-url> \
  --token <TOKEN> \
  --tls-skip-verify

# Validate configuration file
<cli-tool>validate -f config.yaml

Solutions:

1. Invalid Token

# Generate new token in Dashboard
# User → API Tokens → Generate Token

# Test with new token
<cli-tool>sync -f config.yaml --server <URL> --token <NEW_TOKEN>

2. YAML Syntax Error

# Validate YAML
<cli-tool>validate -f config.yaml

# Or use yq/yamllint
yq eval '.' config.yaml

3. SSL Certificate Error

# Use --tls-skip-verify flag (for self-signed certs)
<cli-tool>sync -f config.yaml --server <URL> --token <TOKEN> --tls-skip-verify

Configuration Not Applied

Symptoms:

  • CLI sync succeeds but changes not visible
  • Routes not working as expected

Diagnosis:

# Dump current configuration
<cli-tool>dump --backend <namespace> --server <URL> --token <TOKEN> > current.yaml

# Compare with expected
diff config.yaml current.yaml

Solutions:

1. Routes Not Published

  • Synced routes are NOT active until published
  • Publish via Dashboard UI

2. Wrong Gateway Group

# Specify correct gateway group
<cli-tool>sync -f config.yaml --gateway-group default

3. Cache Issue

# Restart gateway pods to force reload
kubectl rollout restart deployment/<gateway-deployment> -n <namespace>

Useful Debugging Commands

Check All Resources

# All resources in namespace
kubectl get all -n <namespace>

# Wide output with more details
kubectl get all -n <namespace> -o wide

# All resource types including configmaps, secrets
kubectl get all,cm,secret,ingress,pvc -n <namespace>

Log Collection

# All gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --all-containers=true --tail=200

# Dashboard logs
kubectl logs -n <namespace> -l app=<namespace>3-dashboard --tail=100

# Stream logs in real-time
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway -f

Network Testing

# Test from within cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- sh

# Inside pod:
curl http://<gateway-deployment>-gateway.<namespace>.svc.cluster.local
curl http://web-service.<namespace>.svc.cluster.local

Performance Analysis

# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes

# Describe for resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits

Comprehensive troubleshooting guide for API Gateway infrastructure deployment.