diff DEBUGGING_GUIDE.md @ 5:b2ce158b4f22 draft

planemo upload commit ece227052d14d755b0d0b07a827152b2e98fb94b
author enis
date Thu, 24 Jul 2025 21:41:18 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/DEBUGGING_GUIDE.md	Thu Jul 24 21:41:18 2025 +0000
@@ -0,0 +1,303 @@
+# GCP Batch - Kubernetes Connectivity Debugging Guide
+
+## Analysis of Your Test Results
+
+Based on your Google DNS test output, here's what we learned:
+
+### ✅ What's Working
+- **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53
+- **Basic networking is operational**: The Batch worker has internet access
+- **DNS resolution works**: The container can resolve external addresses
+
+### ❌ What's Not Working
+- **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed
+- **Container tooling limited**: `ip` command not available in the container
+
+### 🔍 Key Insight
+This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default.
+
+## Immediate Action Required
+
+Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options:
+
+### 🚀 Quick Fix: NodePort Service (Recommended for Testing)
+
+This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server:
+
+```bash
+# First, find your current NFS service
+kubectl get svc | grep -i nfs
+
+# Create a NodePort service (replace with your actual NFS service details)
+kubectl create service nodeport nfs-ganesha-external \
+  --tcp=2049:2049 \
+  --node-port=32049
+
+# Or apply this YAML:
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Service
+metadata:
+  name: nfs-ganesha-external
+spec:
+  type: NodePort
+  ports:
+  - port: 2049
+    targetPort: 2049
+    nodePort: 32049
+  selector:
+    # Replace with your actual NFS pod labels
+    app: nfs-ganesha
+EOF
+```
+
+Then test with your tool using a GKE node IP and port 32049:
+
+```bash
+# Get a node IP
+kubectl get nodes -o wide
+
+# Test connectivity to <node-ip>:32049
+```
+
+### 🎯 Production Fix: LoadBalancer with Firewall Rules
+
+For production, use a LoadBalancer service with proper firewall configuration:
+
+```bash
+# Create LoadBalancer service
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Service
+metadata:
+  name: nfs-ganesha-lb
+spec:
+  type: LoadBalancer
+  ports:
+  - port: 2049
+    targetPort: 2049
+  selector:
+    # Replace with your actual NFS pod labels
+    app: nfs-ganesha
+EOF
+
+# Wait for external IP assignment
+kubectl get svc nfs-ganesha-lb -w
+
+# Create firewall rule allowing GCP Batch to access NFS
+gcloud compute firewall-rules create allow-nfs-from-batch \
+  --allow tcp:2049 \
+  --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
+  --description "Allow NFS access from GCP Batch workers"
+```
+
+### 📋 Next Steps
+
+1. **Implement NodePort solution** for immediate testing
+2. **Test connectivity** using your enhanced debugging tool with `test_type=custom`
+3. **If NodePort works**, move to LoadBalancer for production use
+4. **Update Galaxy configuration** to use the new NFS endpoint
+
+### 💡 Why This Happens
+
+Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security.
+
+## The Core Problem
+
+You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because:
+
+1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster
+2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default
+3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic
+
+## 🔍 Quick Diagnostic Commands
+
+Run these commands to understand your current setup before making changes:
+
+```bash
+# 1. Find your current NFS-related services and pods
+kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)"
+
+# 2. Check what's actually running
+kubectl get pods -o wide | grep -i nfs
+
+# 3. Look at your current service configuration
+kubectl get svc -o wide | grep -i nfs
+
+# 4. Check if you have any existing LoadBalancer services
+kubectl get svc --field-selector spec.type=LoadBalancer
+
+# 5. Get node IPs for potential NodePort testing
+kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}'
+```
+
+## 🎯 Your Specific Issue Summary
+
+Based on your test output:
+- ✅ **GCP Batch networking works** (can reach 8.8.8.8:53)
+- ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed)
+- 📍 **Root cause**: NFS service is likely ClusterIP type (internal only)
+- 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer
+
+## Debugging Steps
+
+### 1. Test External Connectivity First
+Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works:
+```
+Test Type: Google DNS - External Test
+```
+This should succeed and confirms GCP Batch networking is working.
+
+### 2. Check Your NFS Service Type and Configuration
+
+Run these commands to examine your current NFS setup:
+
+```bash
+# Check NFS-related services
+kubectl get svc | grep -i nfs
+kubectl get svc | grep -i ganesha
+
+# Get detailed service info
+kubectl describe svc <your-nfs-service-name>
+
+# Check endpoints
+kubectl get endpoints | grep -i nfs
+```
+
+### 3. Common Solutions
+
+#### Option A: Use NodePort Service
+NodePort services are accessible from external networks:
+
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: nfs-ganesha-nodeport
+spec:
+  type: NodePort
+  ports:
+  - port: 2049
+    targetPort: 2049
+    nodePort: 32049  # or let K8s assign
+  selector:
+    app: nfs-ganesha
+```
+
+Then test with the node IP:port (e.g., `<node-ip>:32049`)
+
+#### Option B: LoadBalancer with Correct Firewall Rules
+Ensure your LoadBalancer service has proper firewall rules:
+
+```bash
+# Check your LoadBalancer service
+kubectl get svc <nfs-service-name> -o yaml
+
+# Create firewall rule if needed
+gcloud compute firewall-rules create allow-nfs-from-batch \
+  --allow tcp:2049 \
+  --source-ranges 10.0.0.0/8 \
+  --target-tags gke-<cluster-name>-node
+```
+
+#### Option C: Use Cloud Filestore
+For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS:
+
+```bash
+# Create Filestore instance
+gcloud filestore instances create galaxy-filestore \
+  --tier=STANDARD \
+  --file-share=name="galaxy",capacity=1TB \
+  --network=name="<your-vpc>" \
+  --zone=<your-zone>
+```
+
+### 4. Network Debugging Commands
+
+Run these on a GKE node to understand the network setup:
+
+```bash
+# Get node info
+kubectl get nodes -o wide
+
+# Check what's running on nodes
+kubectl get pods -o wide | grep nfs
+
+# Test from a pod inside the cluster
+kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash
+# Then inside the pod:
+nc -zv <nfs-service-ip> 2049
+```
+
+### 5. Advanced Debugging with Enhanced Tool
+
+Use the enhanced tool I created to test different scenarios:
+
+1. **Test Galaxy Web Service**: `test_type=galaxy_web`
+   - This will try to find and test your Galaxy web service
+   - If this fails too, it's a broader networking issue
+
+2. **Test Custom Endpoints**: `test_type=custom`
+   - Test specific IPs you know should work
+   - Try testing a GKE node IP directly
+
+3. **Check Kubernetes DNS**: `test_type=k8s_dns`
+   - This tests if Batch workers can reach Kubernetes cluster services
+
+## 🛠️ Enhanced Container Tools
+
+The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools:
+
+### Core Network Tools
+- `ip` - Advanced IP routing and network device configuration
+- `ping` - Basic connectivity testing
+- `nslookup`/`dig` - DNS resolution testing
+- `curl`/`wget` - HTTP/HTTPS testing
+- `telnet` - Port connectivity testing
+- `traceroute` - Network path tracing
+- `netstat` - Network connection status
+- `ss` - Socket statistics
+- `tcpdump` - Network packet capture
+- `nmap` - Network scanning and port discovery
+
+### Enhanced Test Script
+
+With these tools, the container can now provide much more detailed debugging information:
+
+```bash
+# Network interface details
+ip addr show
+ip route show
+
+# DNS resolution testing
+nslookup target-host
+dig target-host
+
+# Port scanning
+nmap -p 2049 target-host
+
+# HTTP/HTTPS testing (for web services)
+curl -v http://target-host:port
+
+# Network path tracing
+traceroute target-host
+```
+
+## Root Cause Analysis
+
+Based on your description, the most likely issues are:
+
+1. **ClusterIP services** are not accessible from outside the cluster (expected behavior)
+2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs
+3. **Network policies** in your cluster might be blocking external traffic
+4. **GKE cluster** might be using a different subnet than GCP Batch workers
+
+## Recommended Solution
+
+For Galaxy on GKE with GCP Batch integration, I recommend:
+
+1. **Use Google Cloud Filestore** for shared storage (most reliable)
+2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules
+3. **Test with the enhanced debugging tool** to get detailed network information
+
+Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?