gcp_batch_netcat: DEBUGGING_GUIDE.md comparison

comparison DEBUGGING_GUIDE.md @ 5:b2ce158b4f22 draft

planemo upload commit ece227052d14d755b0d0b07a827152b2e98fb94b

author	enis
date	Thu, 24 Jul 2025 21:41:18 +0000
parents
children

comparison

equal deleted inserted replaced

-:2ff4a39ea41b
+:b2ce158b4f22
+# GCP Batch - Kubernetes Connectivity Debugging Guide
+## Analysis of Your Test Results
+Based on your Google DNS test output, here's what we learned:
+### ✅ What's Working
+- **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53
+- **Basic networking is operational**: The Batch worker has internet access
+- **DNS resolution works**: The container can resolve external addresses
+### ❌ What's Not Working
+- **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed
+- **Container tooling limited**: `ip` command not available in the container
+### 🔍 Key Insight
+This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default.
+## Immediate Action Required
+Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options:
+### 🚀 Quick Fix: NodePort Service (Recommended for Testing)
+This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server:
+```bash
+# First, find your current NFS service
+kubectl get svc | grep -i nfs
+# Create a NodePort service (replace with your actual NFS service details)
+kubectl create service nodeport nfs-ganesha-external \
+--tcp=2049:2049 \
+--node-port=32049
+# Or apply this YAML:
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Service
+metadata:
+name: nfs-ganesha-external
+spec:
+type: NodePort
+ports:
+- port: 2049
+targetPort: 2049
+nodePort: 32049
+selector:
+# Replace with your actual NFS pod labels
+app: nfs-ganesha
+EOF
+```
+Then test with your tool using a GKE node IP and port 32049:
+```bash
+# Get a node IP
+kubectl get nodes -o wide
+# Test connectivity to <node-ip>:32049
+```
+### 🎯 Production Fix: LoadBalancer with Firewall Rules
+For production, use a LoadBalancer service with proper firewall configuration:
+```bash
+# Create LoadBalancer service
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Service
+metadata:
+name: nfs-ganesha-lb
+spec:
+type: LoadBalancer
+ports:
+- port: 2049
+targetPort: 2049
+selector:
+# Replace with your actual NFS pod labels
+app: nfs-ganesha
+EOF
+# Wait for external IP assignment
+kubectl get svc nfs-ganesha-lb -w
+# Create firewall rule allowing GCP Batch to access NFS
+gcloud compute firewall-rules create allow-nfs-from-batch \
+--allow tcp:2049 \
+--source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
+--description "Allow NFS access from GCP Batch workers"
+```
+### 📋 Next Steps
+1. **Implement NodePort solution** for immediate testing
+2. **Test connectivity** using your enhanced debugging tool with `test_type=custom`
+3. **If NodePort works**, move to LoadBalancer for production use
+4. **Update Galaxy configuration** to use the new NFS endpoint
+### 💡 Why This Happens
+Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security.
+## The Core Problem
+You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because:
+1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster
+2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default
+3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic
+## 🔍 Quick Diagnostic Commands
+Run these commands to understand your current setup before making changes:
+```bash
+# 1. Find your current NFS-related services and pods
+kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)"
+# 2. Check what's actually running
+kubectl get pods -o wide | grep -i nfs
+# 3. Look at your current service configuration
+kubectl get svc -o wide | grep -i nfs
+# 4. Check if you have any existing LoadBalancer services
+kubectl get svc --field-selector spec.type=LoadBalancer
+# 5. Get node IPs for potential NodePort testing
+kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}'
+```
+## 🎯 Your Specific Issue Summary
+Based on your test output:
+- ✅ **GCP Batch networking works** (can reach 8.8.8.8:53)
+- ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed)
+- 📍 **Root cause**: NFS service is likely ClusterIP type (internal only)
+- 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer
+## Debugging Steps
+### 1. Test External Connectivity First
+Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works:
+```
+Test Type: Google DNS - External Test
+```
+This should succeed and confirms GCP Batch networking is working.
+### 2. Check Your NFS Service Type and Configuration
+Run these commands to examine your current NFS setup:
+```bash
+# Check NFS-related services
+kubectl get svc | grep -i nfs
+kubectl get svc | grep -i ganesha
+# Get detailed service info
+kubectl describe svc <your-nfs-service-name>
+# Check endpoints
+kubectl get endpoints | grep -i nfs
+```
+### 3. Common Solutions
+#### Option A: Use NodePort Service
+NodePort services are accessible from external networks:
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+name: nfs-ganesha-nodeport
+spec:
+type: NodePort
+ports:
+- port: 2049
+targetPort: 2049
+nodePort: 32049  # or let K8s assign
+selector:
+app: nfs-ganesha
+```
+Then test with the node IP:port (e.g., `<node-ip>:32049`)
+#### Option B: LoadBalancer with Correct Firewall Rules
+Ensure your LoadBalancer service has proper firewall rules:
+```bash
+# Check your LoadBalancer service
+kubectl get svc <nfs-service-name> -o yaml
+# Create firewall rule if needed
+gcloud compute firewall-rules create allow-nfs-from-batch \
+--allow tcp:2049 \
+--source-ranges 10.0.0.0/8 \
+--target-tags gke-<cluster-name>-node
+```
+#### Option C: Use Cloud Filestore
+For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS:
+```bash
+# Create Filestore instance
+gcloud filestore instances create galaxy-filestore \
+--tier=STANDARD \
+--file-share=name="galaxy",capacity=1TB \
+--network=name="<your-vpc>" \
+--zone=<your-zone>
+```
+### 4. Network Debugging Commands
+Run these on a GKE node to understand the network setup:
+```bash
+# Get node info
+kubectl get nodes -o wide
+# Check what's running on nodes
+kubectl get pods -o wide | grep nfs
+# Test from a pod inside the cluster
+kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash
+# Then inside the pod:
+nc -zv <nfs-service-ip> 2049
+```
+### 5. Advanced Debugging with Enhanced Tool
+Use the enhanced tool I created to test different scenarios:
+1. **Test Galaxy Web Service**: `test_type=galaxy_web`
+- This will try to find and test your Galaxy web service
+- If this fails too, it's a broader networking issue
+2. **Test Custom Endpoints**: `test_type=custom`
+- Test specific IPs you know should work
+- Try testing a GKE node IP directly
+3. **Check Kubernetes DNS**: `test_type=k8s_dns`
+- This tests if Batch workers can reach Kubernetes cluster services
+## 🛠️ Enhanced Container Tools
+The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools:
+### Core Network Tools
+- `ip` - Advanced IP routing and network device configuration
+- `ping` - Basic connectivity testing
+- `nslookup`/`dig` - DNS resolution testing
+- `curl`/`wget` - HTTP/HTTPS testing
+- `telnet` - Port connectivity testing
+- `traceroute` - Network path tracing
+- `netstat` - Network connection status
+- `ss` - Socket statistics
+- `tcpdump` - Network packet capture
+- `nmap` - Network scanning and port discovery
+### Enhanced Test Script
+With these tools, the container can now provide much more detailed debugging information:
+```bash
+# Network interface details
+ip addr show
+ip route show
+# DNS resolution testing
+nslookup target-host
+dig target-host
+# Port scanning
+nmap -p 2049 target-host
+# HTTP/HTTPS testing (for web services)
+curl -v http://target-host:port
+# Network path tracing
+traceroute target-host
+```
+## Root Cause Analysis
+Based on your description, the most likely issues are:
+1. **ClusterIP services** are not accessible from outside the cluster (expected behavior)
+2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs
+3. **Network policies** in your cluster might be blocking external traffic
+4. **GKE cluster** might be using a different subnet than GCP Batch workers
+## Recommended Solution
+For Galaxy on GKE with GCP Batch integration, I recommend:
+1. **Use Google Cloud Filestore** for shared storage (most reliable)
+2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules
+3. **Test with the enhanced debugging tool** to get detailed network information
+Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?

Mercurial > repos > enis > gcp_batch_netcat

comparison DEBUGGING_GUIDE.md @ 5:b2ce158b4f22 draft