comparison DEBUGGING_GUIDE.md @ 5:b2ce158b4f22 draft

planemo upload commit ece227052d14d755b0d0b07a827152b2e98fb94b
author enis
date Thu, 24 Jul 2025 21:41:18 +0000
parents
children
comparison
equal deleted inserted replaced
4:2ff4a39ea41b 5:b2ce158b4f22
1 # GCP Batch - Kubernetes Connectivity Debugging Guide
2
3 ## Analysis of Your Test Results
4
5 Based on your Google DNS test output, here's what we learned:
6
7 ### ✅ What's Working
8 - **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53
9 - **Basic networking is operational**: The Batch worker has internet access
10 - **DNS resolution works**: The container can resolve external addresses
11
12 ### ❌ What's Not Working
13 - **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed
14 - **Container tooling limited**: `ip` command not available in the container
15
16 ### 🔍 Key Insight
17 This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default.
18
19 ## Immediate Action Required
20
21 Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options:
22
23 ### 🚀 Quick Fix: NodePort Service (Recommended for Testing)
24
25 This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server:
26
27 ```bash
28 # First, find your current NFS service
29 kubectl get svc | grep -i nfs
30
31 # Create a NodePort service (replace with your actual NFS service details)
32 kubectl create service nodeport nfs-ganesha-external \
33 --tcp=2049:2049 \
34 --node-port=32049
35
36 # Or apply this YAML:
37 cat <<EOF | kubectl apply -f -
38 apiVersion: v1
39 kind: Service
40 metadata:
41 name: nfs-ganesha-external
42 spec:
43 type: NodePort
44 ports:
45 - port: 2049
46 targetPort: 2049
47 nodePort: 32049
48 selector:
49 # Replace with your actual NFS pod labels
50 app: nfs-ganesha
51 EOF
52 ```
53
54 Then test with your tool using a GKE node IP and port 32049:
55
56 ```bash
57 # Get a node IP
58 kubectl get nodes -o wide
59
60 # Test connectivity to <node-ip>:32049
61 ```
62
63 ### 🎯 Production Fix: LoadBalancer with Firewall Rules
64
65 For production, use a LoadBalancer service with proper firewall configuration:
66
67 ```bash
68 # Create LoadBalancer service
69 cat <<EOF | kubectl apply -f -
70 apiVersion: v1
71 kind: Service
72 metadata:
73 name: nfs-ganesha-lb
74 spec:
75 type: LoadBalancer
76 ports:
77 - port: 2049
78 targetPort: 2049
79 selector:
80 # Replace with your actual NFS pod labels
81 app: nfs-ganesha
82 EOF
83
84 # Wait for external IP assignment
85 kubectl get svc nfs-ganesha-lb -w
86
87 # Create firewall rule allowing GCP Batch to access NFS
88 gcloud compute firewall-rules create allow-nfs-from-batch \
89 --allow tcp:2049 \
90 --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
91 --description "Allow NFS access from GCP Batch workers"
92 ```
93
94 ### 📋 Next Steps
95
96 1. **Implement NodePort solution** for immediate testing
97 2. **Test connectivity** using your enhanced debugging tool with `test_type=custom`
98 3. **If NodePort works**, move to LoadBalancer for production use
99 4. **Update Galaxy configuration** to use the new NFS endpoint
100
101 ### 💡 Why This Happens
102
103 Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security.
104
105 ## The Core Problem
106
107 You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because:
108
109 1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster
110 2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default
111 3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic
112
113 ## 🔍 Quick Diagnostic Commands
114
115 Run these commands to understand your current setup before making changes:
116
117 ```bash
118 # 1. Find your current NFS-related services and pods
119 kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)"
120
121 # 2. Check what's actually running
122 kubectl get pods -o wide | grep -i nfs
123
124 # 3. Look at your current service configuration
125 kubectl get svc -o wide | grep -i nfs
126
127 # 4. Check if you have any existing LoadBalancer services
128 kubectl get svc --field-selector spec.type=LoadBalancer
129
130 # 5. Get node IPs for potential NodePort testing
131 kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}'
132 ```
133
134 ## 🎯 Your Specific Issue Summary
135
136 Based on your test output:
137 - ✅ **GCP Batch networking works** (can reach 8.8.8.8:53)
138 - ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed)
139 - 📍 **Root cause**: NFS service is likely ClusterIP type (internal only)
140 - 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer
141
142 ## Debugging Steps
143
144 ### 1. Test External Connectivity First
145 Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works:
146 ```
147 Test Type: Google DNS - External Test
148 ```
149 This should succeed and confirms GCP Batch networking is working.
150
151 ### 2. Check Your NFS Service Type and Configuration
152
153 Run these commands to examine your current NFS setup:
154
155 ```bash
156 # Check NFS-related services
157 kubectl get svc | grep -i nfs
158 kubectl get svc | grep -i ganesha
159
160 # Get detailed service info
161 kubectl describe svc <your-nfs-service-name>
162
163 # Check endpoints
164 kubectl get endpoints | grep -i nfs
165 ```
166
167 ### 3. Common Solutions
168
169 #### Option A: Use NodePort Service
170 NodePort services are accessible from external networks:
171
172 ```yaml
173 apiVersion: v1
174 kind: Service
175 metadata:
176 name: nfs-ganesha-nodeport
177 spec:
178 type: NodePort
179 ports:
180 - port: 2049
181 targetPort: 2049
182 nodePort: 32049 # or let K8s assign
183 selector:
184 app: nfs-ganesha
185 ```
186
187 Then test with the node IP:port (e.g., `<node-ip>:32049`)
188
189 #### Option B: LoadBalancer with Correct Firewall Rules
190 Ensure your LoadBalancer service has proper firewall rules:
191
192 ```bash
193 # Check your LoadBalancer service
194 kubectl get svc <nfs-service-name> -o yaml
195
196 # Create firewall rule if needed
197 gcloud compute firewall-rules create allow-nfs-from-batch \
198 --allow tcp:2049 \
199 --source-ranges 10.0.0.0/8 \
200 --target-tags gke-<cluster-name>-node
201 ```
202
203 #### Option C: Use Cloud Filestore
204 For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS:
205
206 ```bash
207 # Create Filestore instance
208 gcloud filestore instances create galaxy-filestore \
209 --tier=STANDARD \
210 --file-share=name="galaxy",capacity=1TB \
211 --network=name="<your-vpc>" \
212 --zone=<your-zone>
213 ```
214
215 ### 4. Network Debugging Commands
216
217 Run these on a GKE node to understand the network setup:
218
219 ```bash
220 # Get node info
221 kubectl get nodes -o wide
222
223 # Check what's running on nodes
224 kubectl get pods -o wide | grep nfs
225
226 # Test from a pod inside the cluster
227 kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash
228 # Then inside the pod:
229 nc -zv <nfs-service-ip> 2049
230 ```
231
232 ### 5. Advanced Debugging with Enhanced Tool
233
234 Use the enhanced tool I created to test different scenarios:
235
236 1. **Test Galaxy Web Service**: `test_type=galaxy_web`
237 - This will try to find and test your Galaxy web service
238 - If this fails too, it's a broader networking issue
239
240 2. **Test Custom Endpoints**: `test_type=custom`
241 - Test specific IPs you know should work
242 - Try testing a GKE node IP directly
243
244 3. **Check Kubernetes DNS**: `test_type=k8s_dns`
245 - This tests if Batch workers can reach Kubernetes cluster services
246
247 ## 🛠️ Enhanced Container Tools
248
249 The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools:
250
251 ### Core Network Tools
252 - `ip` - Advanced IP routing and network device configuration
253 - `ping` - Basic connectivity testing
254 - `nslookup`/`dig` - DNS resolution testing
255 - `curl`/`wget` - HTTP/HTTPS testing
256 - `telnet` - Port connectivity testing
257 - `traceroute` - Network path tracing
258 - `netstat` - Network connection status
259 - `ss` - Socket statistics
260 - `tcpdump` - Network packet capture
261 - `nmap` - Network scanning and port discovery
262
263 ### Enhanced Test Script
264
265 With these tools, the container can now provide much more detailed debugging information:
266
267 ```bash
268 # Network interface details
269 ip addr show
270 ip route show
271
272 # DNS resolution testing
273 nslookup target-host
274 dig target-host
275
276 # Port scanning
277 nmap -p 2049 target-host
278
279 # HTTP/HTTPS testing (for web services)
280 curl -v http://target-host:port
281
282 # Network path tracing
283 traceroute target-host
284 ```
285
286 ## Root Cause Analysis
287
288 Based on your description, the most likely issues are:
289
290 1. **ClusterIP services** are not accessible from outside the cluster (expected behavior)
291 2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs
292 3. **Network policies** in your cluster might be blocking external traffic
293 4. **GKE cluster** might be using a different subnet than GCP Batch workers
294
295 ## Recommended Solution
296
297 For Galaxy on GKE with GCP Batch integration, I recommend:
298
299 1. **Use Google Cloud Filestore** for shared storage (most reliable)
300 2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules
301 3. **Test with the enhanced debugging tool** to get detailed network information
302
303 Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?