Mercurial > repos > enis > gcp_batch_netcat
comparison DEBUGGING_GUIDE.md @ 5:b2ce158b4f22 draft
planemo upload commit ece227052d14d755b0d0b07a827152b2e98fb94b
| author | enis |
|---|---|
| date | Thu, 24 Jul 2025 21:41:18 +0000 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 4:2ff4a39ea41b | 5:b2ce158b4f22 |
|---|---|
| 1 # GCP Batch - Kubernetes Connectivity Debugging Guide | |
| 2 | |
| 3 ## Analysis of Your Test Results | |
| 4 | |
| 5 Based on your Google DNS test output, here's what we learned: | |
| 6 | |
| 7 ### ✅ What's Working | |
| 8 - **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53 | |
| 9 - **Basic networking is operational**: The Batch worker has internet access | |
| 10 - **DNS resolution works**: The container can resolve external addresses | |
| 11 | |
| 12 ### ❌ What's Not Working | |
| 13 - **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed | |
| 14 - **Container tooling limited**: `ip` command not available in the container | |
| 15 | |
| 16 ### 🔍 Key Insight | |
| 17 This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default. | |
| 18 | |
| 19 ## Immediate Action Required | |
| 20 | |
| 21 Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options: | |
| 22 | |
| 23 ### 🚀 Quick Fix: NodePort Service (Recommended for Testing) | |
| 24 | |
| 25 This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server: | |
| 26 | |
| 27 ```bash | |
| 28 # First, find your current NFS service | |
| 29 kubectl get svc | grep -i nfs | |
| 30 | |
| 31 # Create a NodePort service (replace with your actual NFS service details) | |
| 32 kubectl create service nodeport nfs-ganesha-external \ | |
| 33 --tcp=2049:2049 \ | |
| 34 --node-port=32049 | |
| 35 | |
| 36 # Or apply this YAML: | |
| 37 cat <<EOF | kubectl apply -f - | |
| 38 apiVersion: v1 | |
| 39 kind: Service | |
| 40 metadata: | |
| 41 name: nfs-ganesha-external | |
| 42 spec: | |
| 43 type: NodePort | |
| 44 ports: | |
| 45 - port: 2049 | |
| 46 targetPort: 2049 | |
| 47 nodePort: 32049 | |
| 48 selector: | |
| 49 # Replace with your actual NFS pod labels | |
| 50 app: nfs-ganesha | |
| 51 EOF | |
| 52 ``` | |
| 53 | |
| 54 Then test with your tool using a GKE node IP and port 32049: | |
| 55 | |
| 56 ```bash | |
| 57 # Get a node IP | |
| 58 kubectl get nodes -o wide | |
| 59 | |
| 60 # Test connectivity to <node-ip>:32049 | |
| 61 ``` | |
| 62 | |
| 63 ### 🎯 Production Fix: LoadBalancer with Firewall Rules | |
| 64 | |
| 65 For production, use a LoadBalancer service with proper firewall configuration: | |
| 66 | |
| 67 ```bash | |
| 68 # Create LoadBalancer service | |
| 69 cat <<EOF | kubectl apply -f - | |
| 70 apiVersion: v1 | |
| 71 kind: Service | |
| 72 metadata: | |
| 73 name: nfs-ganesha-lb | |
| 74 spec: | |
| 75 type: LoadBalancer | |
| 76 ports: | |
| 77 - port: 2049 | |
| 78 targetPort: 2049 | |
| 79 selector: | |
| 80 # Replace with your actual NFS pod labels | |
| 81 app: nfs-ganesha | |
| 82 EOF | |
| 83 | |
| 84 # Wait for external IP assignment | |
| 85 kubectl get svc nfs-ganesha-lb -w | |
| 86 | |
| 87 # Create firewall rule allowing GCP Batch to access NFS | |
| 88 gcloud compute firewall-rules create allow-nfs-from-batch \ | |
| 89 --allow tcp:2049 \ | |
| 90 --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \ | |
| 91 --description "Allow NFS access from GCP Batch workers" | |
| 92 ``` | |
| 93 | |
| 94 ### 📋 Next Steps | |
| 95 | |
| 96 1. **Implement NodePort solution** for immediate testing | |
| 97 2. **Test connectivity** using your enhanced debugging tool with `test_type=custom` | |
| 98 3. **If NodePort works**, move to LoadBalancer for production use | |
| 99 4. **Update Galaxy configuration** to use the new NFS endpoint | |
| 100 | |
| 101 ### 💡 Why This Happens | |
| 102 | |
| 103 Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security. | |
| 104 | |
| 105 ## The Core Problem | |
| 106 | |
| 107 You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because: | |
| 108 | |
| 109 1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster | |
| 110 2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default | |
| 111 3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic | |
| 112 | |
| 113 ## 🔍 Quick Diagnostic Commands | |
| 114 | |
| 115 Run these commands to understand your current setup before making changes: | |
| 116 | |
| 117 ```bash | |
| 118 # 1. Find your current NFS-related services and pods | |
| 119 kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)" | |
| 120 | |
| 121 # 2. Check what's actually running | |
| 122 kubectl get pods -o wide | grep -i nfs | |
| 123 | |
| 124 # 3. Look at your current service configuration | |
| 125 kubectl get svc -o wide | grep -i nfs | |
| 126 | |
| 127 # 4. Check if you have any existing LoadBalancer services | |
| 128 kubectl get svc --field-selector spec.type=LoadBalancer | |
| 129 | |
| 130 # 5. Get node IPs for potential NodePort testing | |
| 131 kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}' | |
| 132 ``` | |
| 133 | |
| 134 ## 🎯 Your Specific Issue Summary | |
| 135 | |
| 136 Based on your test output: | |
| 137 - ✅ **GCP Batch networking works** (can reach 8.8.8.8:53) | |
| 138 - ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed) | |
| 139 - 📍 **Root cause**: NFS service is likely ClusterIP type (internal only) | |
| 140 - 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer | |
| 141 | |
| 142 ## Debugging Steps | |
| 143 | |
| 144 ### 1. Test External Connectivity First | |
| 145 Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works: | |
| 146 ``` | |
| 147 Test Type: Google DNS - External Test | |
| 148 ``` | |
| 149 This should succeed and confirms GCP Batch networking is working. | |
| 150 | |
| 151 ### 2. Check Your NFS Service Type and Configuration | |
| 152 | |
| 153 Run these commands to examine your current NFS setup: | |
| 154 | |
| 155 ```bash | |
| 156 # Check NFS-related services | |
| 157 kubectl get svc | grep -i nfs | |
| 158 kubectl get svc | grep -i ganesha | |
| 159 | |
| 160 # Get detailed service info | |
| 161 kubectl describe svc <your-nfs-service-name> | |
| 162 | |
| 163 # Check endpoints | |
| 164 kubectl get endpoints | grep -i nfs | |
| 165 ``` | |
| 166 | |
| 167 ### 3. Common Solutions | |
| 168 | |
| 169 #### Option A: Use NodePort Service | |
| 170 NodePort services are accessible from external networks: | |
| 171 | |
| 172 ```yaml | |
| 173 apiVersion: v1 | |
| 174 kind: Service | |
| 175 metadata: | |
| 176 name: nfs-ganesha-nodeport | |
| 177 spec: | |
| 178 type: NodePort | |
| 179 ports: | |
| 180 - port: 2049 | |
| 181 targetPort: 2049 | |
| 182 nodePort: 32049 # or let K8s assign | |
| 183 selector: | |
| 184 app: nfs-ganesha | |
| 185 ``` | |
| 186 | |
| 187 Then test with the node IP:port (e.g., `<node-ip>:32049`) | |
| 188 | |
| 189 #### Option B: LoadBalancer with Correct Firewall Rules | |
| 190 Ensure your LoadBalancer service has proper firewall rules: | |
| 191 | |
| 192 ```bash | |
| 193 # Check your LoadBalancer service | |
| 194 kubectl get svc <nfs-service-name> -o yaml | |
| 195 | |
| 196 # Create firewall rule if needed | |
| 197 gcloud compute firewall-rules create allow-nfs-from-batch \ | |
| 198 --allow tcp:2049 \ | |
| 199 --source-ranges 10.0.0.0/8 \ | |
| 200 --target-tags gke-<cluster-name>-node | |
| 201 ``` | |
| 202 | |
| 203 #### Option C: Use Cloud Filestore | |
| 204 For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS: | |
| 205 | |
| 206 ```bash | |
| 207 # Create Filestore instance | |
| 208 gcloud filestore instances create galaxy-filestore \ | |
| 209 --tier=STANDARD \ | |
| 210 --file-share=name="galaxy",capacity=1TB \ | |
| 211 --network=name="<your-vpc>" \ | |
| 212 --zone=<your-zone> | |
| 213 ``` | |
| 214 | |
| 215 ### 4. Network Debugging Commands | |
| 216 | |
| 217 Run these on a GKE node to understand the network setup: | |
| 218 | |
| 219 ```bash | |
| 220 # Get node info | |
| 221 kubectl get nodes -o wide | |
| 222 | |
| 223 # Check what's running on nodes | |
| 224 kubectl get pods -o wide | grep nfs | |
| 225 | |
| 226 # Test from a pod inside the cluster | |
| 227 kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash | |
| 228 # Then inside the pod: | |
| 229 nc -zv <nfs-service-ip> 2049 | |
| 230 ``` | |
| 231 | |
| 232 ### 5. Advanced Debugging with Enhanced Tool | |
| 233 | |
| 234 Use the enhanced tool I created to test different scenarios: | |
| 235 | |
| 236 1. **Test Galaxy Web Service**: `test_type=galaxy_web` | |
| 237 - This will try to find and test your Galaxy web service | |
| 238 - If this fails too, it's a broader networking issue | |
| 239 | |
| 240 2. **Test Custom Endpoints**: `test_type=custom` | |
| 241 - Test specific IPs you know should work | |
| 242 - Try testing a GKE node IP directly | |
| 243 | |
| 244 3. **Check Kubernetes DNS**: `test_type=k8s_dns` | |
| 245 - This tests if Batch workers can reach Kubernetes cluster services | |
| 246 | |
| 247 ## 🛠️ Enhanced Container Tools | |
| 248 | |
| 249 The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools: | |
| 250 | |
| 251 ### Core Network Tools | |
| 252 - `ip` - Advanced IP routing and network device configuration | |
| 253 - `ping` - Basic connectivity testing | |
| 254 - `nslookup`/`dig` - DNS resolution testing | |
| 255 - `curl`/`wget` - HTTP/HTTPS testing | |
| 256 - `telnet` - Port connectivity testing | |
| 257 - `traceroute` - Network path tracing | |
| 258 - `netstat` - Network connection status | |
| 259 - `ss` - Socket statistics | |
| 260 - `tcpdump` - Network packet capture | |
| 261 - `nmap` - Network scanning and port discovery | |
| 262 | |
| 263 ### Enhanced Test Script | |
| 264 | |
| 265 With these tools, the container can now provide much more detailed debugging information: | |
| 266 | |
| 267 ```bash | |
| 268 # Network interface details | |
| 269 ip addr show | |
| 270 ip route show | |
| 271 | |
| 272 # DNS resolution testing | |
| 273 nslookup target-host | |
| 274 dig target-host | |
| 275 | |
| 276 # Port scanning | |
| 277 nmap -p 2049 target-host | |
| 278 | |
| 279 # HTTP/HTTPS testing (for web services) | |
| 280 curl -v http://target-host:port | |
| 281 | |
| 282 # Network path tracing | |
| 283 traceroute target-host | |
| 284 ``` | |
| 285 | |
| 286 ## Root Cause Analysis | |
| 287 | |
| 288 Based on your description, the most likely issues are: | |
| 289 | |
| 290 1. **ClusterIP services** are not accessible from outside the cluster (expected behavior) | |
| 291 2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs | |
| 292 3. **Network policies** in your cluster might be blocking external traffic | |
| 293 4. **GKE cluster** might be using a different subnet than GCP Batch workers | |
| 294 | |
| 295 ## Recommended Solution | |
| 296 | |
| 297 For Galaxy on GKE with GCP Batch integration, I recommend: | |
| 298 | |
| 299 1. **Use Google Cloud Filestore** for shared storage (most reliable) | |
| 300 2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules | |
| 301 3. **Test with the enhanced debugging tool** to get detailed network information | |
| 302 | |
| 303 Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool? |
