demos/api7-demo

Fork 0

Files

d.viti a2eef9efde

Build and Push Docker Images / build-web (push) Failing after 1m3s

Details

Build and Push Docker Images / build-api (push) Failing after 1m1s

Details

first commit

2025-10-03 01:20:15 +02:00

16 KiB

Raw Blame History

Troubleshooting Guide

Common issues and solutions for API Gateway deployment.

Gateway Issues

Gateway Pods Not Starting

Symptoms:

Gateway pods in CrashLoopBackOff or Error state
Pods continuously restarting

Diagnosis:

# Check pod status
kubectl get pods -n <namespace> -l app.kubernetes.io/name=gateway

# View pod logs
kubectl logs -n <namespace> <gateway-pod-name>

# Describe pod for events
kubectl describe pod -n <namespace> <gateway-pod-name>

Common Causes & Solutions:

1. Configuration Store Connection Failure

# Check Data Plane Manager is running
kubectl get pods -n <namespace> -l app=dp-manager

# Verify configuration endpoint
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml | grep endpoint

# Expected: Configuration store endpoint URL

2. TLS Certificate Issues

# Verify TLS secret exists
kubectl get secret <gateway-tls-secret> -n <namespace>

# Check certificate validity
kubectl get secret <gateway-tls-secret> -n <namespace> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

3. Configuration Error

# Check gateway ConfigMap
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml

# Validate YAML syntax
kubectl get configmap <gateway-configmap> -n <namespace> -o yaml | yq eval '.'

Gateway Returns 404 for All Routes

Symptoms:

All requests return HTTP 404
Routes configured but not working

Diagnosis:

# Check if routes are published in Dashboard
# Navigate: Services → <service> → Routes
# Verify: Route status shows "Published"

# Check gateway logs for route loading
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --tail=100 | grep -i route

Solutions:

1. Routes Not Published

Routes synced via CLI are NOT active until published
Publish each route in Dashboard:
- Services → Select service → Routes tab
- Click "Publish" for each route
- Select appropriate gateway group

2. Wrong Gateway Group

# Verify gateway group
kubectl get deployment <gateway-deployment> -n <namespace> -o yaml | grep GATEWAY_GROUP

# Expected: Configured group name

3. Host Header Mismatch

# Test with correct Host header
curl -H "Host: app.domain.com" http://<gateway-ip>/

# Check route configuration
<cli-tool> dump --server <dashboard-url> --token <TOKEN>

Gateway Service Unavailable (503)

Symptoms:

Requests return HTTP 503
Gateway is running but can't reach backends

Diagnosis:

# Check backend service exists
kubectl get svc -n <namespace> <backend-service-name>

# Check backend endpoints
kubectl get endpoints -n <namespace> <backend-service-name>

# Check gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --tail=100 | grep -i "503\|upstream"

Solutions:

1. Backend Service Not Found

# Verify service exists
kubectl get svc -n <namespace>

# Check service name in route config matches
<cli-tool> dump --server <URL> --token <TOKEN> | grep -A 5 upstream

2. No Healthy Endpoints

# Check if pods are running
kubectl get pods -n <namespace> -l app=<backend-app>

# Verify endpoints exist
kubectl get endpoints -n <namespace> <service-name>

# If empty, check service selector
kubectl get svc <service-name> -n <namespace> -o yaml | grep -A 3 selector
kubectl get pods -n <namespace> --show-labels | grep <label>

3. Service Discovery Not Working

See Service Discovery Issues

Service Discovery Issues

Service Registry Not Found

Error: service registry not found or discovery failed

Diagnosis:

# Check service registry in Dashboard
# Navigate: Settings → Service Registry
# Verify: Status shows "Connected" or "Healthy"

# Check gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway | grep -i discovery

Solutions:

1. Service Registry Not Configured

Dashboard → Settings → Service Registry → Add Service Registry
Type: Kubernetes

Configure:

Name: kubernetes-cluster
API Server: https://kubernetes.default.svc.cluster.local:443
Token Path: /var/run/secrets/kubernetes.io/serviceaccount/token

2. RBAC Permissions Missing

# Check permissions
kubectl auth can-i list endpoints --as=system:serviceaccount:<namespace>:default

# If "no", create ClusterRoleBinding
kubectl create clusterrolebinding gateway-discovery \
  --clusterrole=view \
  --serviceaccount=<namespace>:default

3. Service Port Not Named

# Check service definition
kubectl get svc <service-name> -n <namespace> -o yaml

# Port MUST have a name:
ports:
- port: 80
  targetPort: 8000
  name: http    # ← Required for service discovery

Endpoints Not Discovered

Symptoms:

Service discovery configured but endpoints not updating
Scaling pods doesn't update gateway

Diagnosis:

# Check service endpoints
kubectl get endpoints -n <namespace> <service-name>

# Scale pods and verify endpoints update
kubectl scale deployment <name> -n <namespace> --replicas=5
kubectl get endpoints -n <namespace> <service-name>

# Check gateway discovers endpoints
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway | grep -i endpoint

Solutions:

1. Check Service Registry Connection

# In Dashboard, verify registry status
# Settings → Service Registry → kubernetes-cluster
# Status should be "Connected"

2. Verify Service Name Format

# Format: <namespace>/<service-name>:<port-name>
upstream:
  discovery_type: kubernetes
  service_name: <namespace>/web-service:http
  # NOT: web-service or web-service.<namespace>.svc.cluster.local

3. Restart Gateway Pods

kubectl rollout restart deployment/<gateway-deployment> -n <namespace>

Ingress & Certificate Issues

Certificate Not Trusted / Invalid

Symptoms:

Browser shows "Not Secure" warning
Certificate errors in logs

Diagnosis:

# Check certificate
kubectl get certificate -n <namespace>

# Describe certificate for errors
kubectl describe certificate <cert-name> -n <namespace>

# Check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager --tail=50

# Test certificate
openssl s_client -connect demo.domain.com:443 -servername demo.domain.com < /dev/null 2>/dev/null | openssl x509 -noout -dates -issuer

Solutions:

1. Certificate Not Ready

# Check certificate status
kubectl get certificate -n <namespace>

# If not "True", check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager

2. DNS Challenge Failed

# Check ClusterIssuer
kubectl get clusterissuer

# Verify Cloudflare API token
kubectl get secret cloudflare-api-token-secret -n cert-manager

# Check challenge status
kubectl get challenge -A

3. Manual Certificate Creation

# If cert-manager fails, use acme.sh
export CF_Token="<CLOUDFLARE_TOKEN>"
~/.acme.sh/acme.sh --issue --dns dns_cf -d "*.domain.com" -d "domain.com"

# Create Kubernetes secret
kubectl create secret tls wildcard-tls \
  --cert=~/.acme.sh/*.domain.com_ecc/fullchain.cer \
  --key=~/.acme.sh/*.domain.com_ecc/*.domain.com.key \
  -n <namespace>

LoadBalancer Stuck in Pending

Symptoms:

Ingress EXTERNAL-IP shows <pending>
Cannot access services externally

Diagnosis:

# Check MetalLB
kubectl get pods -n metallb-system

# Check IPAddressPool
kubectl get ipaddresspool -A

# Check service
kubectl describe svc -n ingress-nginx nginx-ingress-lb-custom

Solutions:

1. MetalLB Not Running

# Check MetalLB pods
kubectl get pods -n metallb-system

# Restart if needed
kubectl rollout restart deployment -n metallb-system

2. IP Pool Exhausted

# Check IP pool configuration
kubectl get ipaddresspool -A -o yaml

# Check allocated IPs
kubectl get svc -A -o wide | grep LoadBalancer

3. Annotation Error

# Check MetalLB annotation
kubectl get svc <name> -n <namespace> -o yaml | grep metallb

# Correct format:
metadata:
  annotations:
    metallb.universe.tf/loadBalancerIPs: "<external-ip>"

Ingress Returns 503 Backend Unavailable

Symptoms:

NGINX Ingress returns 503
Backend service is running

Diagnosis:

# Check ingress
kubectl describe ingress <name> -n <namespace>

# Check NGINX logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Check backend service
kubectl get svc <backend-service> -n <namespace>

Solutions:

1. Wrong Backend Service

# Verify ingress points to Gateway Gateway, not application
kubectl get ingress <name> -n <namespace> -o yaml | grep -A 5 backend

# Should be:
backend:
  service:
    name: <gateway-deployment>-gateway
    port:
      number: 80

2. Service Port Mismatch

# Check service ports
kubectl get svc <gateway-deployment>-gateway -n <namespace>

# Ingress should point to port 80, not 9080

Control Plane Issues

Dashboard Not Accessible

Symptoms:

Cannot access https://
Connection timeout or refused

Diagnosis:

# Check dashboard pod
kubectl get pods -n <namespace> -l app=<namespace>3-dashboard

# Check dashboard service
kubectl get svc -n <namespace> <namespace>3-0-1759339083-dashboard

# Check ingress
kubectl get ingress -n <namespace> <namespace>3-0-1759339083-dashboard

Solutions:

1. Pod Not Running

# Check pod status
kubectl get pods -n <namespace> -l app=<namespace>3-dashboard

# View logs
kubectl logs -n <namespace> -l app=<namespace>3-dashboard

2. Port Forward as Workaround

kubectl port-forward -n <namespace> svc/<namespace>3-0-1759339083-dashboard 7080:7080

# Access at http://localhost:7080

PostgreSQL Connection Failed

Symptoms:

Dashboard/Portal shows database errors
Logs show "connection refused" to PostgreSQL

Diagnosis:

# Check PostgreSQL pod
kubectl get pods -n <namespace> -l app=postgresql

# Check PostgreSQL service
kubectl get svc -n <namespace> postgresql

# Test connection from dashboard pod
kubectl exec -n <namespace> -it <dashboard-pod> -- psql -h postgresql -U <namespace> -d <namespace>

Solutions:

1. PostgreSQL Pod Not Running

# Check pod status
kubectl get pods -n <namespace> postgresql-0

# View logs
kubectl logs -n <namespace> postgresql-0

2. Credentials Mismatch

# Check credentials in secret
kubectl get secret postgresql -n <namespace> -o jsonpath='{.data.postgres-password}' | base64 -d

# Compare with DSN in dashboard config
kubectl get configmap <namespace>3-0-1759339083-dashboard-config -n <namespace> -o yaml | grep dsn

3. Storage Issues

# Check PVC
kubectl get pvc -n <namespace> data-postgresql-0

# If storage full, expand PVC (if storage class supports it)
kubectl patch pvc data-postgresql-0 -n <namespace> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

Application Issues

Image Pull Errors

Error: ImagePullBackOff or ErrImagePull

Diagnosis:

# Check pod status
kubectl get pods -n <namespace>

# Describe pod
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

# Check image name
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].image}'

Solutions:

1. Registry Authentication

# Create registry secret
kubectl create secret docker-registry registry-secret \
  --docker-server=<registry-url> \
  --docker-username=<USERNAME> \
  --docker-password=<TOKEN> \
  -n <namespace>

# Add to deployment
spec:
  template:
    spec:
      imagePullSecrets:
      - name: registry-secret

2. Image Does Not Exist

# Verify image exists
docker pull <registry-url>/web:main

# Check available tags via Gitea UI or API
curl -u <username>:<token> https://<registry-url>/api/v1/packages/demos

3. Wrong Image Name

# Correct format:
<registry-url>/web:main

# NOT:
<registry-url>:main  # Missing /web

Application Crashing

Symptoms:

Pods in CrashLoopBackOff
Application logs show errors

Diagnosis:

# Check pod logs
kubectl logs -n <namespace> <pod-name>

# Check previous pod logs (if restarted)
kubectl logs -n <namespace> <pod-name> --previous

# Check events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Solutions:

1. Port Mismatch

# Verify app runs on correct port
# Check Dockerfile CMD or deployment env vars

# Common issue: App runs on 8000, container expects 3000

2. Missing Dependencies

# Check application logs for import errors
kubectl logs -n <namespace> <pod-name>

# Rebuild image with correct requirements.txt

3. Resource Limits

# Check if OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"

# If memory limit too low, increase:
resources:
  limits:
    memory: 512Mi  # Increase from 256Mi

CLI / Configuration Issues

CLI Sync Fails

Error: failed to sync configuration or authentication errors

Diagnosis:

# Test CLI connection
<cli-tool>ping \
  --backend <namespace> \
  --server https://<dashboard-url> \
  --token <TOKEN> \
  --tls-skip-verify

# Validate configuration file
<cli-tool>validate -f config.yaml

Solutions:

1. Invalid Token

# Generate new token in Dashboard
# User → API Tokens → Generate Token

# Test with new token
<cli-tool>sync -f config.yaml --server <URL> --token <NEW_TOKEN>

2. YAML Syntax Error

# Validate YAML
<cli-tool>validate -f config.yaml

# Or use yq/yamllint
yq eval '.' config.yaml

3. SSL Certificate Error

# Use --tls-skip-verify flag (for self-signed certs)
<cli-tool>sync -f config.yaml --server <URL> --token <TOKEN> --tls-skip-verify

Configuration Not Applied

Symptoms:

CLI sync succeeds but changes not visible
Routes not working as expected

Diagnosis:

# Dump current configuration
<cli-tool>dump --backend <namespace> --server <URL> --token <TOKEN> > current.yaml

# Compare with expected
diff config.yaml current.yaml

Solutions:

1. Routes Not Published

Synced routes are NOT active until published
Publish via Dashboard UI

2. Wrong Gateway Group

# Specify correct gateway group
<cli-tool>sync -f config.yaml --gateway-group default

3. Cache Issue

# Restart gateway pods to force reload
kubectl rollout restart deployment/<gateway-deployment> -n <namespace>

Useful Debugging Commands

Check All Resources

# All resources in namespace
kubectl get all -n <namespace>

# Wide output with more details
kubectl get all -n <namespace> -o wide

# All resource types including configmaps, secrets
kubectl get all,cm,secret,ingress,pvc -n <namespace>

Log Collection

# All gateway logs
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway --all-containers=true --tail=200

# Dashboard logs
kubectl logs -n <namespace> -l app=<namespace>3-dashboard --tail=100

# Stream logs in real-time
kubectl logs -n <namespace> -l app.kubernetes.io/name=gateway -f

Network Testing

# Test from within cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- sh

# Inside pod:
curl http://<gateway-deployment>-gateway.<namespace>.svc.cluster.local
curl http://web-service.<namespace>.svc.cluster.local

Performance Analysis

# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes

# Describe for resource limits
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits

Comprehensive troubleshooting guide for API Gateway infrastructure deployment.

16 KiB Raw Blame History

Troubleshooting Guide

Gateway Issues

Gateway Pods Not Starting

Gateway Returns 404 for All Routes

Gateway Service Unavailable (503)

Service Discovery Issues

Service Registry Not Found

Endpoints Not Discovered

Ingress & Certificate Issues

Certificate Not Trusted / Invalid

LoadBalancer Stuck in Pending

Ingress Returns 503 Backend Unavailable

Control Plane Issues

Dashboard Not Accessible

PostgreSQL Connection Failed

Application Issues

Image Pull Errors

Application Crashing

CLI / Configuration Issues

CLI Sync Fails

Configuration Not Applied

Useful Debugging Commands

Check All Resources

Log Collection

Network Testing

Performance Analysis

16 KiB

Raw Blame History