16 KiB
Troubleshooting Guide
Common issues and solutions for API Gateway deployment.
Gateway Issues
Gateway Pods Not Starting
Symptoms:
- Gateway pods in
CrashLoopBackOfforErrorstate - Pods continuously restarting
Diagnosis:
# Check pod status
kubectl get pods -n <namespace> -l app.kubernetes.io/name=gateway
# View pod logs
kubectl logs -n <namespace> <gateway-pod-name>
# Describe pod for events
kubectl describe pod -n <namespace> <gateway-pod-name>
Common Causes & Solutions:
1. Configuration Store Connection Failure
# Check Data Plane Manager is running
kubectl get pods -n <namespace> -l app=dp-manager
# Verify configuration endpoint
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml | grep endpoint
# Expected: Configuration store endpoint URL
2. TLS Certificate Issues
# Verify TLS secret exists
kubectl get secret <gateway-tls-secret> -n <namespace>
# Check certificate validity
kubectl get secret <gateway-tls-secret> -n <namespace> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
3. Configuration Error
# Check gateway ConfigMap
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml
# Validate YAML syntax
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml | yq eval '.'
Gateway Returns 404 for All Routes
Symptoms:
- All requests return HTTP 404
- Routes configured but not working
Diagnosis:
# Check if routes are published in Dashboard
# Navigate: Services → <service> → Routes
# Verify: Route status shows "Published"
# Check gateway logs for route loading
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --tail=100 | grep -i route
Solutions:
1. Routes Not Published
- Routes synced via CLI are NOT active until published
- Publish each route in Dashboard:
- Services → Select service → Routes tab
- Click "Publish" for each route
- Select appropriate gateway group
2. Wrong Gateway Group
# Verify gateway group
kubectl get deployment <gateway-deployment> -n <namespace> -o yaml | grep GATEWAY_GROUP
# Expected: Configured group name
3. Host Header Mismatch
# Test with correct Host header
curl -H "Host: app.domain.com" http://<gateway-ip>/
# Check route configuration
<cli-tool> dump --server <dashboard-url> --token <TOKEN>
Gateway Service Unavailable (503)
Symptoms:
- Requests return HTTP 503
- Gateway is running but can't reach backends
Diagnosis:
# Check backend service exists
kubectl get svc -n <namespace> <backend-service-name>
# Check backend endpoints
kubectl get endpoints -n <namespace> <backend-service-name>
# Check gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --tail=100 | grep -i "503\|upstream"
Solutions:
1. Backend Service Not Found
# Verify service exists
kubectl get svc -n <namespace>
# Check service name in route config matches
<cli-tool> dump --server <URL> --token <TOKEN> | grep -A 5 upstream
2. No Healthy Endpoints
# Check if pods are running
kubectl get pods -n <namespace> -l app=<backend-app>
# Verify endpoints exist
kubectl get endpoints -n <namespace> <service-name>
# If empty, check service selector
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A 3 selector
kubectl get pods -n <namespace> --show-labels | grep <label>
3. Service Discovery Not Working
Service Discovery Issues
Service Registry Not Found
Error: service registry not found or discovery failed
Diagnosis:
# Check service registry in Dashboard
# Navigate: Settings → Service Registry
# Verify: Status shows "Connected" or "Healthy"
# Check gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway | grep -i discovery
Solutions:
1. Service Registry Not Configured
- Dashboard → Settings → Service Registry → Add Service Registry
- Type: Kubernetes
- Configure:
Name: kubernetes-cluster API Server: https://kubernetes.default.svc.cluster.local:443 Token Path: /var/run/secrets/kubernetes.io/serviceaccount/token
2. RBAC Permissions Missing
# Check permissions
kubectl auth can-i list endpoints --as=system:serviceaccount:<namespace>:default
# If "no", create ClusterRoleBinding
kubectl create clusterrolebinding gateway-discovery \
--clusterrole=view \
--serviceaccount=<namespace>:default
3. Service Port Not Named
# Check service definition
kubectl get svc <service-name> -n <namespace> -o yaml
# Port MUST have a name:
ports:
- port: 80
targetPort: 8000
name: http # ← Required for service discovery
Endpoints Not Discovered
Symptoms:
- Service discovery configured but endpoints not updating
- Scaling pods doesn't update gateway
Diagnosis:
# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>
# Scale pods and verify endpoints update
kubectl scale deployment <name> -n <namespace> --replicas=5
kubectl get endpoints -n <namespace> <service-name>
# Check gateway discovers endpoints
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway | grep -i endpoint
Solutions:
1. Check Service Registry Connection
# In Dashboard, verify registry status
# Settings → Service Registry → kubernetes-cluster
# Status should be "Connected"
2. Verify Service Name Format
# Format: <namespace>/<service-name>:<port-name>
upstream:
discovery_type: kubernetes
service_name: <namespace>/web-service:http
# NOT: web-service or web-service.<namespace>.svc.cluster.local
3. Restart Gateway Pods
kubectl rollout restart deployment/<gateway-deployment> -n <namespace>
Ingress & Certificate Issues
Certificate Not Trusted / Invalid
Symptoms:
- Browser shows "Not Secure" warning
- Certificate errors in logs
Diagnosis:
# Check certificate
kubectl get certificate -n <namespace>
# Describe certificate for errors
kubectl describe certificate <cert-name> -n <namespace>
# Check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager --tail=50
# Test certificate
openssl s_client -connect demo.domain.com:443 -servername demo.domain.com < /dev/null 2>/dev/null | openssl x509 -noout -dates -issuer
Solutions:
1. Certificate Not Ready
# Check certificate status
kubectl get certificate -n <namespace>
# If not "True", check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager
2. DNS Challenge Failed
# Check ClusterIssuer
kubectl get clusterissuer
# Verify Cloudflare API token
kubectl get secret cloudflare-api-token-secret -n cert-manager
# Check challenge status
kubectl get challenge -A
3. Manual Certificate Creation
# If cert-manager fails, use acme.sh
export CF_Token="<CLOUDFLARE_TOKEN>"
~/.acme.sh/acme.sh --issue --dns dns_cf -d "*.domain.com" -d "domain.com"
# Create Kubernetes secret
kubectl create secret tls wildcard-tls \
--cert=~/.acme.sh/*.domain.com_ecc/fullchain.cer \
--key=~/.acme.sh/*.domain.com_ecc/*.domain.com.key \
-n <namespace>
LoadBalancer Stuck in Pending
Symptoms:
- Ingress EXTERNAL-IP shows
<pending> - Cannot access services externally
Diagnosis:
# Check MetalLB
kubectl get pods -n metallb-system
# Check IPAddressPool
kubectl get ipaddresspool -A
# Check service
kubectl describe svc -n ingress-nginx nginx-ingress-lb-custom
Solutions:
1. MetalLB Not Running
# Check MetalLB pods
kubectl get pods -n metallb-system
# Restart if needed
kubectl rollout restart deployment -n metallb-system
2. IP Pool Exhausted
# Check IP pool configuration
kubectl get ipaddresspool -A -o yaml
# Check allocated IPs
kubectl get svc -A -o wide | grep LoadBalancer
3. Annotation Error
# Check MetalLB annotation
kubectl get svc <name> -n <namespace> -o yaml | grep metallb
# Correct format:
metadata:
annotations:
metallb.universe.tf/loadBalancerIPs: "<external-ip>"
Ingress Returns 503 Backend Unavailable
Symptoms:
- NGINX Ingress returns 503
- Backend service is running
Diagnosis:
# Check ingress
kubectl describe ingress <name> -n <namespace>
# Check NGINX logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# Check backend service
kubectl get svc <backend-service> -n <namespace>
Solutions:
1. Wrong Backend Service
# Verify ingress points to Gateway Gateway, not application
kubectl get ingress <name> -n <namespace> -o yaml | grep -A 5 backend
# Should be:
backend:
service:
name: <gateway-deployment>-gateway
port:
number: 80
2. Service Port Mismatch
# Check service ports
kubectl get svc <gateway-deployment>-gateway -n <namespace>
# Ingress should point to port 80, not 9080
Control Plane Issues
Dashboard Not Accessible
Symptoms:
- Cannot access https://
- Connection timeout or refused
Diagnosis:
# Check dashboard pod
kubectl get pods -n <namespace> -l app=<namespace>3-dashboard
# Check dashboard service
kubectl get svc -n <namespace> <namespace>3-0-1759339083-dashboard
# Check ingress
kubectl get ingress -n <namespace> <namespace>3-0-1759339083-dashboard
Solutions:
1. Pod Not Running
# Check pod status
kubectl get pods -n <namespace> -l app=<namespace>3-dashboard
# View logs
kubectl logs -n <namespace> -l app=<namespace>3-dashboard
2. Port Forward as Workaround
kubectl port-forward -n <namespace> svc/<namespace>3-0-1759339083-dashboard 7080:7080
# Access at http://localhost:7080
PostgreSQL Connection Failed
Symptoms:
- Dashboard/Portal shows database errors
- Logs show "connection refused" to PostgreSQL
Diagnosis:
# Check PostgreSQL pod
kubectl get pods -n <namespace> -l app=postgresql
# Check PostgreSQL service
kubectl get svc -n <namespace> postgresql
# Test connection from dashboard pod
kubectl exec -n <namespace> -it <dashboard-pod> -- psql -h postgresql -U <namespace> -d <namespace>
Solutions:
1. PostgreSQL Pod Not Running
# Check pod status
kubectl get pods -n <namespace> postgresql-0
# View logs
kubectl logs -n <namespace> postgresql-0
2. Credentials Mismatch
# Check credentials in secret
kubectl get secret postgresql -n <namespace> -o jsonpath='{.data.postgres-password}' | base64 -d
# Compare with DSN in dashboard config
kubectl get configmap <namespace>3-0-1759339083-dashboard-config -n <namespace> -o yaml | grep dsn
3. Storage Issues
# Check PVC
kubectl get pvc -n <namespace> data-postgresql-0
# If storage full, expand PVC (if storage class supports it)
kubectl patch pvc data-postgresql-0 -n <namespace> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
Application Issues
Image Pull Errors
Error: ImagePullBackOff or ErrImagePull
Diagnosis:
# Check pod status
kubectl get pods -n <namespace>
# Describe pod
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
# Check image name
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].image}'
Solutions:
1. Registry Authentication
# Create registry secret
kubectl create secret docker-registry registry-secret \
--docker-server=<registry-url> \
--docker-username=<USERNAME> \
--docker-password=<TOKEN> \
-n <namespace>
# Add to deployment
spec:
template:
spec:
imagePullSecrets:
- name: registry-secret
2. Image Does Not Exist
# Verify image exists
docker pull <registry-url>/web:main
# Check available tags via Gitea UI or API
curl -u <username>:<token> https://<registry-url>/api/v1/packages/demos
3. Wrong Image Name
# Correct format:
<registry-url>/web:main
# NOT:
<registry-url>:main # Missing /web
Application Crashing
Symptoms:
- Pods in
CrashLoopBackOff - Application logs show errors
Diagnosis:
# Check pod logs
kubectl logs -n <namespace> <pod-name>
# Check previous pod logs (if restarted)
kubectl logs -n <namespace> <pod-name> --previous
# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
Solutions:
1. Port Mismatch
# Verify app runs on correct port
# Check Dockerfile CMD or deployment env vars
# Common issue: App runs on 8000, container expects 3000
2. Missing Dependencies
# Check application logs for import errors
kubectl logs -n <namespace> <pod-name>
# Rebuild image with correct requirements.txt
3. Resource Limits
# Check if OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"
# If memory limit too low, increase:
resources:
limits:
memory: 512Mi # Increase from 256Mi
CLI / Configuration Issues
CLI Sync Fails
Error: failed to sync configuration or authentication errors
Diagnosis:
# Test CLI connection
<cli-tool>ping \
--backend <namespace> \
--server https://<dashboard-url> \
--token <TOKEN> \
--tls-skip-verify
# Validate configuration file
<cli-tool>validate -f config.yaml
Solutions:
1. Invalid Token
# Generate new token in Dashboard
# User → API Tokens → Generate Token
# Test with new token
<cli-tool>sync -f config.yaml --server <URL> --token <NEW_TOKEN>
2. YAML Syntax Error
# Validate YAML
<cli-tool>validate -f config.yaml
# Or use yq/yamllint
yq eval '.' config.yaml
3. SSL Certificate Error
# Use --tls-skip-verify flag (for self-signed certs)
<cli-tool>sync -f config.yaml --server <URL> --token <TOKEN> --tls-skip-verify
Configuration Not Applied
Symptoms:
- CLI sync succeeds but changes not visible
- Routes not working as expected
Diagnosis:
# Dump current configuration
<cli-tool>dump --backend <namespace> --server <URL> --token <TOKEN> > current.yaml
# Compare with expected
diff config.yaml current.yaml
Solutions:
1. Routes Not Published
- Synced routes are NOT active until published
- Publish via Dashboard UI
2. Wrong Gateway Group
# Specify correct gateway group
<cli-tool>sync -f config.yaml --gateway-group default
3. Cache Issue
# Restart gateway pods to force reload
kubectl rollout restart deployment/<gateway-deployment> -n <namespace>
Useful Debugging Commands
Check All Resources
# All resources in namespace
kubectl get all -n <namespace>
# Wide output with more details
kubectl get all -n <namespace> -o wide
# All resource types including configmaps, secrets
kubectl get all,cm,secret,ingress,pvc -n <namespace>
Log Collection
# All gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --all-containers=true --tail=200
# Dashboard logs
kubectl logs -n <namespace> -l app=<namespace>3-dashboard --tail=100
# Stream logs in real-time
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway -f
Network Testing
# Test from within cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- sh
# Inside pod:
curl http://<gateway-deployment>-gateway.<namespace>.svc.cluster.local
curl http://web-service.<namespace>.svc.cluster.local
Performance Analysis
# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes
# Describe for resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
Comprehensive troubleshooting guide for API Gateway infrastructure deployment.