Mercurial > repos > enis > gcp_batch_netcat

# GCP Batch - Kubernetes Connectivity Debugging Guide

## Analysis of Your Test Results

Based on your Google DNS test output, here's what we learned:

### ✅ What's Working
- **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53
- **Basic networking is operational**: The Batch worker has internet access
- **DNS resolution works**: The container can resolve external addresses

### ❌ What's Not Working
- **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed
- **Container tooling limited**: `ip` command not available in the container

### 🔍 Key Insight
This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default.

## Immediate Action Required

Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options:

### 🚀 Quick Fix: NodePort Service (Recommended for Testing)

This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server:

```bash
# First, find your current NFS service
kubectl get svc | grep -i nfs

# Create a NodePort service (replace with your actual NFS service details)
kubectl create service nodeport nfs-ganesha-external \
  --tcp=2049:2049 \
  --node-port=32049

# Or apply this YAML:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: nfs-ganesha-external
spec:
  type: NodePort
  ports:
  - port: 2049
    targetPort: 2049
    nodePort: 32049
  selector:
    # Replace with your actual NFS pod labels
    app: nfs-ganesha
EOF
```

Then test with your tool using a GKE node IP and port 32049:

```bash
# Get a node IP
kubectl get nodes -o wide

# Test connectivity to <node-ip>:32049
```

### 🎯 Production Fix: LoadBalancer with Firewall Rules

For production, use a LoadBalancer service with proper firewall configuration:

```bash
# Create LoadBalancer service
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: nfs-ganesha-lb
spec:
  type: LoadBalancer
  ports:
  - port: 2049
    targetPort: 2049
  selector:
    # Replace with your actual NFS pod labels
    app: nfs-ganesha
EOF

# Wait for external IP assignment
kubectl get svc nfs-ganesha-lb -w

# Create firewall rule allowing GCP Batch to access NFS
gcloud compute firewall-rules create allow-nfs-from-batch \
  --allow tcp:2049 \
  --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
  --description "Allow NFS access from GCP Batch workers"
```

### 📋 Next Steps

1. **Implement NodePort solution** for immediate testing
2. **Test connectivity** using your enhanced debugging tool with `test_type=custom`
3. **If NodePort works**, move to LoadBalancer for production use
4. **Update Galaxy configuration** to use the new NFS endpoint

### 💡 Why This Happens

Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security.

## The Core Problem

You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because:

1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster
2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default
3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic

## 🔍 Quick Diagnostic Commands

Run these commands to understand your current setup before making changes:

```bash
# 1. Find your current NFS-related services and pods
kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)"

# 2. Check what's actually running
kubectl get pods -o wide | grep -i nfs

# 3. Look at your current service configuration
kubectl get svc -o wide | grep -i nfs

# 4. Check if you have any existing LoadBalancer services
kubectl get svc --field-selector spec.type=LoadBalancer

# 5. Get node IPs for potential NodePort testing
kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}'
```

## 🎯 Your Specific Issue Summary

Based on your test output:
- ✅ **GCP Batch networking works** (can reach 8.8.8.8:53)
- ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed)
- 📍 **Root cause**: NFS service is likely ClusterIP type (internal only)
- 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer

## Debugging Steps

### 1. Test External Connectivity First
Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works:
```
Test Type: Google DNS - External Test
```
This should succeed and confirms GCP Batch networking is working.

### 2. Check Your NFS Service Type and Configuration

Run these commands to examine your current NFS setup:

```bash
# Check NFS-related services
kubectl get svc | grep -i nfs
kubectl get svc | grep -i ganesha

# Get detailed service info
kubectl describe svc <your-nfs-service-name>

# Check endpoints
kubectl get endpoints | grep -i nfs
```

### 3. Common Solutions

#### Option A: Use NodePort Service
NodePort services are accessible from external networks:

```yaml
apiVersion: v1
kind: Service
metadata:
  name: nfs-ganesha-nodeport
spec:
  type: NodePort
  ports:
  - port: 2049
    targetPort: 2049
    nodePort: 32049  # or let K8s assign
  selector:
    app: nfs-ganesha
```

Then test with the node IP:port (e.g., `<node-ip>:32049`)

#### Option B: LoadBalancer with Correct Firewall Rules
Ensure your LoadBalancer service has proper firewall rules:

```bash
# Check your LoadBalancer service
kubectl get svc <nfs-service-name> -o yaml

# Create firewall rule if needed
gcloud compute firewall-rules create allow-nfs-from-batch \
  --allow tcp:2049 \
  --source-ranges 10.0.0.0/8 \
  --target-tags gke-<cluster-name>-node
```

#### Option C: Use Cloud Filestore
For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS:

```bash
# Create Filestore instance
gcloud filestore instances create galaxy-filestore \
  --tier=STANDARD \
  --file-share=name="galaxy",capacity=1TB \
  --network=name="<your-vpc>" \
  --zone=<your-zone>
```

### 4. Network Debugging Commands

Run these on a GKE node to understand the network setup:

```bash
# Get node info
kubectl get nodes -o wide

# Check what's running on nodes
kubectl get pods -o wide | grep nfs

# Test from a pod inside the cluster
kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash
# Then inside the pod:
nc -zv <nfs-service-ip> 2049
```

### 5. Advanced Debugging with Enhanced Tool

Use the enhanced tool I created to test different scenarios:

1. **Test Galaxy Web Service**: `test_type=galaxy_web`
   - This will try to find and test your Galaxy web service
   - If this fails too, it's a broader networking issue

2. **Test Custom Endpoints**: `test_type=custom`
   - Test specific IPs you know should work
   - Try testing a GKE node IP directly

3. **Check Kubernetes DNS**: `test_type=k8s_dns`
   - This tests if Batch workers can reach Kubernetes cluster services

## 🛠️ Enhanced Container Tools

The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools:

### Core Network Tools
- `ip` - Advanced IP routing and network device configuration
- `ping` - Basic connectivity testing
- `nslookup`/`dig` - DNS resolution testing
- `curl`/`wget` - HTTP/HTTPS testing
- `telnet` - Port connectivity testing
- `traceroute` - Network path tracing
- `netstat` - Network connection status
- `ss` - Socket statistics
- `tcpdump` - Network packet capture
- `nmap` - Network scanning and port discovery

### Enhanced Test Script

With these tools, the container can now provide much more detailed debugging information:

```bash
# Network interface details
ip addr show
ip route show

# DNS resolution testing
nslookup target-host
dig target-host

# Port scanning
nmap -p 2049 target-host

# HTTP/HTTPS testing (for web services)
curl -v http://target-host:port

# Network path tracing
traceroute target-host
```

## Root Cause Analysis

Based on your description, the most likely issues are:

1. **ClusterIP services** are not accessible from outside the cluster (expected behavior)
2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs
3. **Network policies** in your cluster might be blocking external traffic
4. **GKE cluster** might be using a different subnet than GCP Batch workers

## Recommended Solution

For Galaxy on GKE with GCP Batch integration, I recommend:

1. **Use Google Cloud Filestore** for shared storage (most reliable)
2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules
3. **Test with the enhanced debugging tool** to get detailed network information

Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?
author	enis
date	Thu, 14 Aug 2025 16:39:36 +0000
parents	b2ce158b4f22
children