Mercurial > repos > enis > gcp_batch_netcat
view DEBUGGING_GUIDE.md @ 11:fe0bf22037a5 draft
planemo upload for repository https://github.com/afgane/gcp_batch_netcat commit f730cbb207e028a5d4fd982fe65ece7345af4879
author | enis |
---|---|
date | Thu, 14 Aug 2025 16:39:36 +0000 |
parents | b2ce158b4f22 |
children |
line wrap: on
line source
# GCP Batch - Kubernetes Connectivity Debugging Guide ## Analysis of Your Test Results Based on your Google DNS test output, here's what we learned: ### ✅ What's Working - **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53 - **Basic networking is operational**: The Batch worker has internet access - **DNS resolution works**: The container can resolve external addresses ### ❌ What's Not Working - **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed - **Container tooling limited**: `ip` command not available in the container ### 🔍 Key Insight This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default. ## Immediate Action Required Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options: ### 🚀 Quick Fix: NodePort Service (Recommended for Testing) This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server: ```bash # First, find your current NFS service kubectl get svc | grep -i nfs # Create a NodePort service (replace with your actual NFS service details) kubectl create service nodeport nfs-ganesha-external \ --tcp=2049:2049 \ --node-port=32049 # Or apply this YAML: cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: nfs-ganesha-external spec: type: NodePort ports: - port: 2049 targetPort: 2049 nodePort: 32049 selector: # Replace with your actual NFS pod labels app: nfs-ganesha EOF ``` Then test with your tool using a GKE node IP and port 32049: ```bash # Get a node IP kubectl get nodes -o wide # Test connectivity to <node-ip>:32049 ``` ### 🎯 Production Fix: LoadBalancer with Firewall Rules For production, use a LoadBalancer service with proper firewall configuration: ```bash # Create LoadBalancer service cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: nfs-ganesha-lb spec: type: LoadBalancer ports: - port: 2049 targetPort: 2049 selector: # Replace with your actual NFS pod labels app: nfs-ganesha EOF # Wait for external IP assignment kubectl get svc nfs-ganesha-lb -w # Create firewall rule allowing GCP Batch to access NFS gcloud compute firewall-rules create allow-nfs-from-batch \ --allow tcp:2049 \ --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \ --description "Allow NFS access from GCP Batch workers" ``` ### 📋 Next Steps 1. **Implement NodePort solution** for immediate testing 2. **Test connectivity** using your enhanced debugging tool with `test_type=custom` 3. **If NodePort works**, move to LoadBalancer for production use 4. **Update Galaxy configuration** to use the new NFS endpoint ### 💡 Why This Happens Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security. ## The Core Problem You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because: 1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster 2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default 3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic ## 🔍 Quick Diagnostic Commands Run these commands to understand your current setup before making changes: ```bash # 1. Find your current NFS-related services and pods kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)" # 2. Check what's actually running kubectl get pods -o wide | grep -i nfs # 3. Look at your current service configuration kubectl get svc -o wide | grep -i nfs # 4. Check if you have any existing LoadBalancer services kubectl get svc --field-selector spec.type=LoadBalancer # 5. Get node IPs for potential NodePort testing kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}' ``` ## 🎯 Your Specific Issue Summary Based on your test output: - ✅ **GCP Batch networking works** (can reach 8.8.8.8:53) - ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed) - 📍 **Root cause**: NFS service is likely ClusterIP type (internal only) - 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer ## Debugging Steps ### 1. Test External Connectivity First Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works: ``` Test Type: Google DNS - External Test ``` This should succeed and confirms GCP Batch networking is working. ### 2. Check Your NFS Service Type and Configuration Run these commands to examine your current NFS setup: ```bash # Check NFS-related services kubectl get svc | grep -i nfs kubectl get svc | grep -i ganesha # Get detailed service info kubectl describe svc <your-nfs-service-name> # Check endpoints kubectl get endpoints | grep -i nfs ``` ### 3. Common Solutions #### Option A: Use NodePort Service NodePort services are accessible from external networks: ```yaml apiVersion: v1 kind: Service metadata: name: nfs-ganesha-nodeport spec: type: NodePort ports: - port: 2049 targetPort: 2049 nodePort: 32049 # or let K8s assign selector: app: nfs-ganesha ``` Then test with the node IP:port (e.g., `<node-ip>:32049`) #### Option B: LoadBalancer with Correct Firewall Rules Ensure your LoadBalancer service has proper firewall rules: ```bash # Check your LoadBalancer service kubectl get svc <nfs-service-name> -o yaml # Create firewall rule if needed gcloud compute firewall-rules create allow-nfs-from-batch \ --allow tcp:2049 \ --source-ranges 10.0.0.0/8 \ --target-tags gke-<cluster-name>-node ``` #### Option C: Use Cloud Filestore For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS: ```bash # Create Filestore instance gcloud filestore instances create galaxy-filestore \ --tier=STANDARD \ --file-share=name="galaxy",capacity=1TB \ --network=name="<your-vpc>" \ --zone=<your-zone> ``` ### 4. Network Debugging Commands Run these on a GKE node to understand the network setup: ```bash # Get node info kubectl get nodes -o wide # Check what's running on nodes kubectl get pods -o wide | grep nfs # Test from a pod inside the cluster kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash # Then inside the pod: nc -zv <nfs-service-ip> 2049 ``` ### 5. Advanced Debugging with Enhanced Tool Use the enhanced tool I created to test different scenarios: 1. **Test Galaxy Web Service**: `test_type=galaxy_web` - This will try to find and test your Galaxy web service - If this fails too, it's a broader networking issue 2. **Test Custom Endpoints**: `test_type=custom` - Test specific IPs you know should work - Try testing a GKE node IP directly 3. **Check Kubernetes DNS**: `test_type=k8s_dns` - This tests if Batch workers can reach Kubernetes cluster services ## 🛠️ Enhanced Container Tools The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools: ### Core Network Tools - `ip` - Advanced IP routing and network device configuration - `ping` - Basic connectivity testing - `nslookup`/`dig` - DNS resolution testing - `curl`/`wget` - HTTP/HTTPS testing - `telnet` - Port connectivity testing - `traceroute` - Network path tracing - `netstat` - Network connection status - `ss` - Socket statistics - `tcpdump` - Network packet capture - `nmap` - Network scanning and port discovery ### Enhanced Test Script With these tools, the container can now provide much more detailed debugging information: ```bash # Network interface details ip addr show ip route show # DNS resolution testing nslookup target-host dig target-host # Port scanning nmap -p 2049 target-host # HTTP/HTTPS testing (for web services) curl -v http://target-host:port # Network path tracing traceroute target-host ``` ## Root Cause Analysis Based on your description, the most likely issues are: 1. **ClusterIP services** are not accessible from outside the cluster (expected behavior) 2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs 3. **Network policies** in your cluster might be blocking external traffic 4. **GKE cluster** might be using a different subnet than GCP Batch workers ## Recommended Solution For Galaxy on GKE with GCP Batch integration, I recommend: 1. **Use Google Cloud Filestore** for shared storage (most reliable) 2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules 3. **Test with the enhanced debugging tool** to get detailed network information Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?