view DEBUGGING_GUIDE.md @ 11:fe0bf22037a5 draft

planemo upload for repository https://github.com/afgane/gcp_batch_netcat commit f730cbb207e028a5d4fd982fe65ece7345af4879
author enis
date Thu, 14 Aug 2025 16:39:36 +0000
parents b2ce158b4f22
children
line wrap: on
line source

# GCP Batch - Kubernetes Connectivity Debugging Guide

## Analysis of Your Test Results

Based on your Google DNS test output, here's what we learned:

### ✅ What's Working
- **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53
- **Basic networking is operational**: The Batch worker has internet access
- **DNS resolution works**: The container can resolve external addresses

### ❌ What's Not Working
- **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed
- **Container tooling limited**: `ip` command not available in the container

### 🔍 Key Insight
This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default.

## Immediate Action Required

Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options:

### 🚀 Quick Fix: NodePort Service (Recommended for Testing)

This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server:

```bash
# First, find your current NFS service
kubectl get svc | grep -i nfs

# Create a NodePort service (replace with your actual NFS service details)
kubectl create service nodeport nfs-ganesha-external \
  --tcp=2049:2049 \
  --node-port=32049

# Or apply this YAML:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: nfs-ganesha-external
spec:
  type: NodePort
  ports:
  - port: 2049
    targetPort: 2049
    nodePort: 32049
  selector:
    # Replace with your actual NFS pod labels
    app: nfs-ganesha
EOF
```

Then test with your tool using a GKE node IP and port 32049:

```bash
# Get a node IP
kubectl get nodes -o wide

# Test connectivity to <node-ip>:32049
```

### 🎯 Production Fix: LoadBalancer with Firewall Rules

For production, use a LoadBalancer service with proper firewall configuration:

```bash
# Create LoadBalancer service
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: nfs-ganesha-lb
spec:
  type: LoadBalancer
  ports:
  - port: 2049
    targetPort: 2049
  selector:
    # Replace with your actual NFS pod labels
    app: nfs-ganesha
EOF

# Wait for external IP assignment
kubectl get svc nfs-ganesha-lb -w

# Create firewall rule allowing GCP Batch to access NFS
gcloud compute firewall-rules create allow-nfs-from-batch \
  --allow tcp:2049 \
  --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
  --description "Allow NFS access from GCP Batch workers"
```

### 📋 Next Steps

1. **Implement NodePort solution** for immediate testing
2. **Test connectivity** using your enhanced debugging tool with `test_type=custom`
3. **If NodePort works**, move to LoadBalancer for production use
4. **Update Galaxy configuration** to use the new NFS endpoint

### 💡 Why This Happens

Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security.

## The Core Problem

You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because:

1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster
2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default
3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic

## 🔍 Quick Diagnostic Commands

Run these commands to understand your current setup before making changes:

```bash
# 1. Find your current NFS-related services and pods
kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)"

# 2. Check what's actually running
kubectl get pods -o wide | grep -i nfs

# 3. Look at your current service configuration
kubectl get svc -o wide | grep -i nfs

# 4. Check if you have any existing LoadBalancer services
kubectl get svc --field-selector spec.type=LoadBalancer

# 5. Get node IPs for potential NodePort testing
kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}'
```

## 🎯 Your Specific Issue Summary

Based on your test output:
- ✅ **GCP Batch networking works** (can reach 8.8.8.8:53)
- ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed)
- 📍 **Root cause**: NFS service is likely ClusterIP type (internal only)
- 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer

## Debugging Steps

### 1. Test External Connectivity First
Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works:
```
Test Type: Google DNS - External Test
```
This should succeed and confirms GCP Batch networking is working.

### 2. Check Your NFS Service Type and Configuration

Run these commands to examine your current NFS setup:

```bash
# Check NFS-related services
kubectl get svc | grep -i nfs
kubectl get svc | grep -i ganesha

# Get detailed service info
kubectl describe svc <your-nfs-service-name>

# Check endpoints
kubectl get endpoints | grep -i nfs
```

### 3. Common Solutions

#### Option A: Use NodePort Service
NodePort services are accessible from external networks:

```yaml
apiVersion: v1
kind: Service
metadata:
  name: nfs-ganesha-nodeport
spec:
  type: NodePort
  ports:
  - port: 2049
    targetPort: 2049
    nodePort: 32049  # or let K8s assign
  selector:
    app: nfs-ganesha
```

Then test with the node IP:port (e.g., `<node-ip>:32049`)

#### Option B: LoadBalancer with Correct Firewall Rules
Ensure your LoadBalancer service has proper firewall rules:

```bash
# Check your LoadBalancer service
kubectl get svc <nfs-service-name> -o yaml

# Create firewall rule if needed
gcloud compute firewall-rules create allow-nfs-from-batch \
  --allow tcp:2049 \
  --source-ranges 10.0.0.0/8 \
  --target-tags gke-<cluster-name>-node
```

#### Option C: Use Cloud Filestore
For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS:

```bash
# Create Filestore instance
gcloud filestore instances create galaxy-filestore \
  --tier=STANDARD \
  --file-share=name="galaxy",capacity=1TB \
  --network=name="<your-vpc>" \
  --zone=<your-zone>
```

### 4. Network Debugging Commands

Run these on a GKE node to understand the network setup:

```bash
# Get node info
kubectl get nodes -o wide

# Check what's running on nodes
kubectl get pods -o wide | grep nfs

# Test from a pod inside the cluster
kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash
# Then inside the pod:
nc -zv <nfs-service-ip> 2049
```

### 5. Advanced Debugging with Enhanced Tool

Use the enhanced tool I created to test different scenarios:

1. **Test Galaxy Web Service**: `test_type=galaxy_web`
   - This will try to find and test your Galaxy web service
   - If this fails too, it's a broader networking issue

2. **Test Custom Endpoints**: `test_type=custom`
   - Test specific IPs you know should work
   - Try testing a GKE node IP directly

3. **Check Kubernetes DNS**: `test_type=k8s_dns`
   - This tests if Batch workers can reach Kubernetes cluster services

## 🛠️ Enhanced Container Tools

The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools:

### Core Network Tools
- `ip` - Advanced IP routing and network device configuration
- `ping` - Basic connectivity testing
- `nslookup`/`dig` - DNS resolution testing
- `curl`/`wget` - HTTP/HTTPS testing
- `telnet` - Port connectivity testing
- `traceroute` - Network path tracing
- `netstat` - Network connection status
- `ss` - Socket statistics
- `tcpdump` - Network packet capture
- `nmap` - Network scanning and port discovery

### Enhanced Test Script

With these tools, the container can now provide much more detailed debugging information:

```bash
# Network interface details
ip addr show
ip route show

# DNS resolution testing
nslookup target-host
dig target-host

# Port scanning
nmap -p 2049 target-host

# HTTP/HTTPS testing (for web services)
curl -v http://target-host:port

# Network path tracing
traceroute target-host
```

## Root Cause Analysis

Based on your description, the most likely issues are:

1. **ClusterIP services** are not accessible from outside the cluster (expected behavior)
2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs
3. **Network policies** in your cluster might be blocking external traffic
4. **GKE cluster** might be using a different subnet than GCP Batch workers

## Recommended Solution

For Galaxy on GKE with GCP Batch integration, I recommend:

1. **Use Google Cloud Filestore** for shared storage (most reliable)
2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules
3. **Test with the enhanced debugging tool** to get detailed network information

Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?