Mercurial > repos > enis > gcp_batch_netcat
diff DEBUGGING_GUIDE.md @ 5:b2ce158b4f22 draft
planemo upload commit ece227052d14d755b0d0b07a827152b2e98fb94b
author | enis |
---|---|
date | Thu, 24 Jul 2025 21:41:18 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/DEBUGGING_GUIDE.md Thu Jul 24 21:41:18 2025 +0000 @@ -0,0 +1,303 @@ +# GCP Batch - Kubernetes Connectivity Debugging Guide + +## Analysis of Your Test Results + +Based on your Google DNS test output, here's what we learned: + +### ✅ What's Working +- **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53 +- **Basic networking is operational**: The Batch worker has internet access +- **DNS resolution works**: The container can resolve external addresses + +### ❌ What's Not Working +- **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed +- **Container tooling limited**: `ip` command not available in the container + +### 🔍 Key Insight +This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default. + +## Immediate Action Required + +Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options: + +### 🚀 Quick Fix: NodePort Service (Recommended for Testing) + +This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server: + +```bash +# First, find your current NFS service +kubectl get svc | grep -i nfs + +# Create a NodePort service (replace with your actual NFS service details) +kubectl create service nodeport nfs-ganesha-external \ + --tcp=2049:2049 \ + --node-port=32049 + +# Or apply this YAML: +cat <<EOF | kubectl apply -f - +apiVersion: v1 +kind: Service +metadata: + name: nfs-ganesha-external +spec: + type: NodePort + ports: + - port: 2049 + targetPort: 2049 + nodePort: 32049 + selector: + # Replace with your actual NFS pod labels + app: nfs-ganesha +EOF +``` + +Then test with your tool using a GKE node IP and port 32049: + +```bash +# Get a node IP +kubectl get nodes -o wide + +# Test connectivity to <node-ip>:32049 +``` + +### 🎯 Production Fix: LoadBalancer with Firewall Rules + +For production, use a LoadBalancer service with proper firewall configuration: + +```bash +# Create LoadBalancer service +cat <<EOF | kubectl apply -f - +apiVersion: v1 +kind: Service +metadata: + name: nfs-ganesha-lb +spec: + type: LoadBalancer + ports: + - port: 2049 + targetPort: 2049 + selector: + # Replace with your actual NFS pod labels + app: nfs-ganesha +EOF + +# Wait for external IP assignment +kubectl get svc nfs-ganesha-lb -w + +# Create firewall rule allowing GCP Batch to access NFS +gcloud compute firewall-rules create allow-nfs-from-batch \ + --allow tcp:2049 \ + --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \ + --description "Allow NFS access from GCP Batch workers" +``` + +### 📋 Next Steps + +1. **Implement NodePort solution** for immediate testing +2. **Test connectivity** using your enhanced debugging tool with `test_type=custom` +3. **If NodePort works**, move to LoadBalancer for production use +4. **Update Galaxy configuration** to use the new NFS endpoint + +### 💡 Why This Happens + +Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security. + +## The Core Problem + +You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because: + +1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster +2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default +3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic + +## 🔍 Quick Diagnostic Commands + +Run these commands to understand your current setup before making changes: + +```bash +# 1. Find your current NFS-related services and pods +kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)" + +# 2. Check what's actually running +kubectl get pods -o wide | grep -i nfs + +# 3. Look at your current service configuration +kubectl get svc -o wide | grep -i nfs + +# 4. Check if you have any existing LoadBalancer services +kubectl get svc --field-selector spec.type=LoadBalancer + +# 5. Get node IPs for potential NodePort testing +kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}' +``` + +## 🎯 Your Specific Issue Summary + +Based on your test output: +- ✅ **GCP Batch networking works** (can reach 8.8.8.8:53) +- ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed) +- 📍 **Root cause**: NFS service is likely ClusterIP type (internal only) +- 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer + +## Debugging Steps + +### 1. Test External Connectivity First +Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works: +``` +Test Type: Google DNS - External Test +``` +This should succeed and confirms GCP Batch networking is working. + +### 2. Check Your NFS Service Type and Configuration + +Run these commands to examine your current NFS setup: + +```bash +# Check NFS-related services +kubectl get svc | grep -i nfs +kubectl get svc | grep -i ganesha + +# Get detailed service info +kubectl describe svc <your-nfs-service-name> + +# Check endpoints +kubectl get endpoints | grep -i nfs +``` + +### 3. Common Solutions + +#### Option A: Use NodePort Service +NodePort services are accessible from external networks: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: nfs-ganesha-nodeport +spec: + type: NodePort + ports: + - port: 2049 + targetPort: 2049 + nodePort: 32049 # or let K8s assign + selector: + app: nfs-ganesha +``` + +Then test with the node IP:port (e.g., `<node-ip>:32049`) + +#### Option B: LoadBalancer with Correct Firewall Rules +Ensure your LoadBalancer service has proper firewall rules: + +```bash +# Check your LoadBalancer service +kubectl get svc <nfs-service-name> -o yaml + +# Create firewall rule if needed +gcloud compute firewall-rules create allow-nfs-from-batch \ + --allow tcp:2049 \ + --source-ranges 10.0.0.0/8 \ + --target-tags gke-<cluster-name>-node +``` + +#### Option C: Use Cloud Filestore +For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS: + +```bash +# Create Filestore instance +gcloud filestore instances create galaxy-filestore \ + --tier=STANDARD \ + --file-share=name="galaxy",capacity=1TB \ + --network=name="<your-vpc>" \ + --zone=<your-zone> +``` + +### 4. Network Debugging Commands + +Run these on a GKE node to understand the network setup: + +```bash +# Get node info +kubectl get nodes -o wide + +# Check what's running on nodes +kubectl get pods -o wide | grep nfs + +# Test from a pod inside the cluster +kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash +# Then inside the pod: +nc -zv <nfs-service-ip> 2049 +``` + +### 5. Advanced Debugging with Enhanced Tool + +Use the enhanced tool I created to test different scenarios: + +1. **Test Galaxy Web Service**: `test_type=galaxy_web` + - This will try to find and test your Galaxy web service + - If this fails too, it's a broader networking issue + +2. **Test Custom Endpoints**: `test_type=custom` + - Test specific IPs you know should work + - Try testing a GKE node IP directly + +3. **Check Kubernetes DNS**: `test_type=k8s_dns` + - This tests if Batch workers can reach Kubernetes cluster services + +## 🛠️ Enhanced Container Tools + +The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools: + +### Core Network Tools +- `ip` - Advanced IP routing and network device configuration +- `ping` - Basic connectivity testing +- `nslookup`/`dig` - DNS resolution testing +- `curl`/`wget` - HTTP/HTTPS testing +- `telnet` - Port connectivity testing +- `traceroute` - Network path tracing +- `netstat` - Network connection status +- `ss` - Socket statistics +- `tcpdump` - Network packet capture +- `nmap` - Network scanning and port discovery + +### Enhanced Test Script + +With these tools, the container can now provide much more detailed debugging information: + +```bash +# Network interface details +ip addr show +ip route show + +# DNS resolution testing +nslookup target-host +dig target-host + +# Port scanning +nmap -p 2049 target-host + +# HTTP/HTTPS testing (for web services) +curl -v http://target-host:port + +# Network path tracing +traceroute target-host +``` + +## Root Cause Analysis + +Based on your description, the most likely issues are: + +1. **ClusterIP services** are not accessible from outside the cluster (expected behavior) +2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs +3. **Network policies** in your cluster might be blocking external traffic +4. **GKE cluster** might be using a different subnet than GCP Batch workers + +## Recommended Solution + +For Galaxy on GKE with GCP Batch integration, I recommend: + +1. **Use Google Cloud Filestore** for shared storage (most reliable) +2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules +3. **Test with the enhanced debugging tool** to get detailed network information + +Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?