changeset 5:b2ce158b4f22 draft

planemo upload commit ece227052d14d755b0d0b07a827152b2e98fb94b
author enis
date Thu, 24 Jul 2025 21:41:18 +0000
parents 2ff4a39ea41b
children d25792770df8
files DEBUGGING_GUIDE.md Dockerfile gcp_batch_netcat.py gcp_batch_netcat.xml
diffstat 4 files changed, 581 insertions(+), 65 deletions(-) [+]
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/DEBUGGING_GUIDE.md	Thu Jul 24 21:41:18 2025 +0000
@@ -0,0 +1,303 @@
+# GCP Batch - Kubernetes Connectivity Debugging Guide
+
+## Analysis of Your Test Results
+
+Based on your Google DNS test output, here's what we learned:
+
+### ✅ What's Working
+- **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53
+- **Basic networking is operational**: The Batch worker has internet access
+- **DNS resolution works**: The container can resolve external addresses
+
+### ❌ What's Not Working
+- **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed
+- **Container tooling limited**: `ip` command not available in the container
+
+### 🔍 Key Insight
+This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default.
+
+## Immediate Action Required
+
+Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options:
+
+### 🚀 Quick Fix: NodePort Service (Recommended for Testing)
+
+This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server:
+
+```bash
+# First, find your current NFS service
+kubectl get svc | grep -i nfs
+
+# Create a NodePort service (replace with your actual NFS service details)
+kubectl create service nodeport nfs-ganesha-external \
+  --tcp=2049:2049 \
+  --node-port=32049
+
+# Or apply this YAML:
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Service
+metadata:
+  name: nfs-ganesha-external
+spec:
+  type: NodePort
+  ports:
+  - port: 2049
+    targetPort: 2049
+    nodePort: 32049
+  selector:
+    # Replace with your actual NFS pod labels
+    app: nfs-ganesha
+EOF
+```
+
+Then test with your tool using a GKE node IP and port 32049:
+
+```bash
+# Get a node IP
+kubectl get nodes -o wide
+
+# Test connectivity to <node-ip>:32049
+```
+
+### 🎯 Production Fix: LoadBalancer with Firewall Rules
+
+For production, use a LoadBalancer service with proper firewall configuration:
+
+```bash
+# Create LoadBalancer service
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Service
+metadata:
+  name: nfs-ganesha-lb
+spec:
+  type: LoadBalancer
+  ports:
+  - port: 2049
+    targetPort: 2049
+  selector:
+    # Replace with your actual NFS pod labels
+    app: nfs-ganesha
+EOF
+
+# Wait for external IP assignment
+kubectl get svc nfs-ganesha-lb -w
+
+# Create firewall rule allowing GCP Batch to access NFS
+gcloud compute firewall-rules create allow-nfs-from-batch \
+  --allow tcp:2049 \
+  --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
+  --description "Allow NFS access from GCP Batch workers"
+```
+
+### 📋 Next Steps
+
+1. **Implement NodePort solution** for immediate testing
+2. **Test connectivity** using your enhanced debugging tool with `test_type=custom`
+3. **If NodePort works**, move to LoadBalancer for production use
+4. **Update Galaxy configuration** to use the new NFS endpoint
+
+### 💡 Why This Happens
+
+Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security.
+
+## The Core Problem
+
+You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because:
+
+1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster
+2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default
+3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic
+
+## 🔍 Quick Diagnostic Commands
+
+Run these commands to understand your current setup before making changes:
+
+```bash
+# 1. Find your current NFS-related services and pods
+kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)"
+
+# 2. Check what's actually running
+kubectl get pods -o wide | grep -i nfs
+
+# 3. Look at your current service configuration
+kubectl get svc -o wide | grep -i nfs
+
+# 4. Check if you have any existing LoadBalancer services
+kubectl get svc --field-selector spec.type=LoadBalancer
+
+# 5. Get node IPs for potential NodePort testing
+kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}'
+```
+
+## 🎯 Your Specific Issue Summary
+
+Based on your test output:
+- ✅ **GCP Batch networking works** (can reach 8.8.8.8:53)
+- ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed)
+- 📍 **Root cause**: NFS service is likely ClusterIP type (internal only)
+- 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer
+
+## Debugging Steps
+
+### 1. Test External Connectivity First
+Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works:
+```
+Test Type: Google DNS - External Test
+```
+This should succeed and confirms GCP Batch networking is working.
+
+### 2. Check Your NFS Service Type and Configuration
+
+Run these commands to examine your current NFS setup:
+
+```bash
+# Check NFS-related services
+kubectl get svc | grep -i nfs
+kubectl get svc | grep -i ganesha
+
+# Get detailed service info
+kubectl describe svc <your-nfs-service-name>
+
+# Check endpoints
+kubectl get endpoints | grep -i nfs
+```
+
+### 3. Common Solutions
+
+#### Option A: Use NodePort Service
+NodePort services are accessible from external networks:
+
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: nfs-ganesha-nodeport
+spec:
+  type: NodePort
+  ports:
+  - port: 2049
+    targetPort: 2049
+    nodePort: 32049  # or let K8s assign
+  selector:
+    app: nfs-ganesha
+```
+
+Then test with the node IP:port (e.g., `<node-ip>:32049`)
+
+#### Option B: LoadBalancer with Correct Firewall Rules
+Ensure your LoadBalancer service has proper firewall rules:
+
+```bash
+# Check your LoadBalancer service
+kubectl get svc <nfs-service-name> -o yaml
+
+# Create firewall rule if needed
+gcloud compute firewall-rules create allow-nfs-from-batch \
+  --allow tcp:2049 \
+  --source-ranges 10.0.0.0/8 \
+  --target-tags gke-<cluster-name>-node
+```
+
+#### Option C: Use Cloud Filestore
+For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS:
+
+```bash
+# Create Filestore instance
+gcloud filestore instances create galaxy-filestore \
+  --tier=STANDARD \
+  --file-share=name="galaxy",capacity=1TB \
+  --network=name="<your-vpc>" \
+  --zone=<your-zone>
+```
+
+### 4. Network Debugging Commands
+
+Run these on a GKE node to understand the network setup:
+
+```bash
+# Get node info
+kubectl get nodes -o wide
+
+# Check what's running on nodes
+kubectl get pods -o wide | grep nfs
+
+# Test from a pod inside the cluster
+kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash
+# Then inside the pod:
+nc -zv <nfs-service-ip> 2049
+```
+
+### 5. Advanced Debugging with Enhanced Tool
+
+Use the enhanced tool I created to test different scenarios:
+
+1. **Test Galaxy Web Service**: `test_type=galaxy_web`
+   - This will try to find and test your Galaxy web service
+   - If this fails too, it's a broader networking issue
+
+2. **Test Custom Endpoints**: `test_type=custom`
+   - Test specific IPs you know should work
+   - Try testing a GKE node IP directly
+
+3. **Check Kubernetes DNS**: `test_type=k8s_dns`
+   - This tests if Batch workers can reach Kubernetes cluster services
+
+## 🛠️ Enhanced Container Tools
+
+The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools:
+
+### Core Network Tools
+- `ip` - Advanced IP routing and network device configuration
+- `ping` - Basic connectivity testing
+- `nslookup`/`dig` - DNS resolution testing
+- `curl`/`wget` - HTTP/HTTPS testing
+- `telnet` - Port connectivity testing
+- `traceroute` - Network path tracing
+- `netstat` - Network connection status
+- `ss` - Socket statistics
+- `tcpdump` - Network packet capture
+- `nmap` - Network scanning and port discovery
+
+### Enhanced Test Script
+
+With these tools, the container can now provide much more detailed debugging information:
+
+```bash
+# Network interface details
+ip addr show
+ip route show
+
+# DNS resolution testing
+nslookup target-host
+dig target-host
+
+# Port scanning
+nmap -p 2049 target-host
+
+# HTTP/HTTPS testing (for web services)
+curl -v http://target-host:port
+
+# Network path tracing
+traceroute target-host
+```
+
+## Root Cause Analysis
+
+Based on your description, the most likely issues are:
+
+1. **ClusterIP services** are not accessible from outside the cluster (expected behavior)
+2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs
+3. **Network policies** in your cluster might be blocking external traffic
+4. **GKE cluster** might be using a different subnet than GCP Batch workers
+
+## Recommended Solution
+
+For Galaxy on GKE with GCP Batch integration, I recommend:
+
+1. **Use Google Cloud Filestore** for shared storage (most reliable)
+2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules
+3. **Test with the enhanced debugging tool** to get detailed network information
+
+Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?
--- a/Dockerfile	Tue Jul 22 14:47:47 2025 +0000
+++ b/Dockerfile	Thu Jul 24 21:41:18 2025 +0000
@@ -1,6 +1,22 @@
 FROM google/cloud-sdk:latest
 
-RUN apt-get update && apt-get install -y python3 python3-pip netcat-openbsd
+# Install essential networking and debugging tools
+RUN apt-get update && apt-get install -y \
+    python3 \
+    python3-pip \
+    netcat-openbsd \
+    iproute2 \
+    iputils-ping \
+    dnsutils \
+    curl \
+    wget \
+    telnet \
+    traceroute \
+    net-tools \
+    tcpdump \
+    nmap \
+    vim \
+    && rm -rf /var/lib/apt/lists/*
 
 RUN pip3 install --break-system-packages google-cloud-batch
 
--- a/gcp_batch_netcat.py	Tue Jul 22 14:47:47 2025 +0000
+++ b/gcp_batch_netcat.py	Thu Jul 24 21:41:18 2025 +0000
@@ -3,12 +3,10 @@
 import logging
 import os
 import sys
-# import time
 import uuid
 from google.cloud import batch_v1
 
 # Configure logging to go to stdout instead of stderr to avoid Galaxy marking job as failed
-import sys
 logging.basicConfig(
     level=logging.INFO,
     format='%(asctime)s - %(levelname)s - %(message)s',
@@ -16,6 +14,114 @@
 )
 logger = logging.getLogger(__name__)
 
+def determine_test_target(args):
+    """Determine the target host and port based on test type"""
+
+    if args.test_type == 'custom':
+        if not args.custom_host:
+            raise ValueError("custom_host is required when test_type is 'custom'")
+        return args.custom_host, args.custom_port
+
+    elif args.test_type == 'nfs':
+        # Extract NFS server address if not provided
+        if args.nfs_address:
+            nfs_address = args.nfs_address
+            logger.info(f"Using provided NFS address: {nfs_address}")
+        else:
+            try:
+                # Try to detect NFS server from /galaxy/server/database/ mount
+                import subprocess
+                result = subprocess.run(['mount'], capture_output=True, text=True)
+                nfs_address = None
+
+                for line in result.stdout.split('\n'):
+                    if '/galaxy/server/database' in line and ':' in line:
+                        # Look for NFS mount pattern: server:/path on /galaxy/server/database
+                        parts = line.split()
+                        for part in parts:
+                            if ':' in part and part.count(':') == 1:
+                                nfs_address = part.split(':')[0]
+                                break
+                        if nfs_address:
+                            logger.info(f"Detected NFS address from mount: {nfs_address}")
+                            break
+
+                if not nfs_address:
+                    # Fallback: try to parse /proc/mounts
+                    try:
+                        with open('/proc/mounts', 'r') as f:
+                            for line in f:
+                                if '/galaxy/server/database' in line and ':' in line:
+                                    parts = line.split()
+                                    if len(parts) > 0 and ':' in parts[0]:
+                                        nfs_address = parts[0].split(':')[0]
+                                        logger.info(f"Detected NFS address from /proc/mounts: {nfs_address}")
+                                        break
+                    except:
+                        pass
+
+                if not nfs_address:
+                    raise ValueError("Could not auto-detect NFS server address from /galaxy/server/database/ mount")
+
+                logger.info(f"Auto-detected NFS address from mount: {nfs_address}")
+            except Exception as e:
+                logger.error(f"Failed to auto-detect NFS address: {e}")
+                raise
+        return nfs_address, 2049
+
+    elif args.test_type == 'galaxy_web':
+        # Try to detect Galaxy web service
+        try:
+            import subprocess
+            result = subprocess.run(['kubectl', 'get', 'svc', '-o', 'json'], capture_output=True, text=True)
+            if result.returncode == 0:
+                services = json.loads(result.stdout)
+                for item in services.get('items', []):
+                    name = item.get('metadata', {}).get('name', '')
+                    if 'galaxy' in name.lower() and ('web' in name.lower() or 'nginx' in name.lower()):
+                        # Found a Galaxy web service
+                        spec = item.get('spec', {})
+                        if spec.get('type') == 'LoadBalancer':
+                            ingress = item.get('status', {}).get('loadBalancer', {}).get('ingress', [])
+                            if ingress:
+                                ip = ingress[0].get('ip')
+                                if ip:
+                                    port = 80
+                                    for port_spec in spec.get('ports', []):
+                                        if port_spec.get('port'):
+                                            port = port_spec['port']
+                                            break
+                                    logger.info(f"Found Galaxy web service LoadBalancer: {ip}:{port}")
+                                    return ip, port
+                        # Fallback to ClusterIP
+                        cluster_ip = spec.get('clusterIP')
+                        if cluster_ip and cluster_ip != 'None':
+                            port = 80
+                            for port_spec in spec.get('ports', []):
+                                if port_spec.get('port'):
+                                    port = port_spec['port']
+                                    break
+                            logger.info(f"Found Galaxy web service ClusterIP: {cluster_ip}:{port}")
+                            return cluster_ip, port
+        except Exception as e:
+            logger.warning(f"Could not auto-detect Galaxy web service: {e}")
+
+        # Fallback: try common Galaxy service names
+        common_hosts = ['galaxy-web', 'galaxy-nginx', 'galaxy']
+        logger.info(f"Trying common Galaxy service name: {common_hosts[0]}")
+        return common_hosts[0], 80
+
+    elif args.test_type == 'k8s_dns':
+        # Test Kubernetes DNS resolution
+        return 'kubernetes.default.svc.cluster.local', 443
+
+    elif args.test_type == 'google_dns':
+        # Test external connectivity
+        return '8.8.8.8', 53
+
+    else:
+        raise ValueError(f"Unsupported test type: {args.test_type}")
+
 def main():
     parser = argparse.ArgumentParser()
     parser.add_argument('--nfs_address', required=False, help='NFS server address (if not provided, will be auto-detected from /galaxy/server/database/ mount)')
@@ -25,6 +131,10 @@
     parser.add_argument('--network', default='default', help='GCP Network name')
     parser.add_argument('--subnet', default='default', help='GCP Subnet name')
     parser.add_argument('--service_account_key', required=True)
+    parser.add_argument('--test_type', default='nfs', choices=['nfs', 'galaxy_web', 'k8s_dns', 'google_dns', 'custom'],
+                       help='Type of connectivity test to perform')
+    parser.add_argument('--custom_host', required=False, help='Custom host to test (required if test_type is custom)')
+    parser.add_argument('--custom_port', type=int, default=80, help='Custom port to test (default: 80)')
     args = parser.parse_args()
 
     # Set up authentication using the service account key
@@ -47,52 +157,13 @@
             logger.error(f"Failed to extract project ID from service account key: {e}")
             raise
 
-    # Extract NFS server address if not provided
-    if args.nfs_address:
-        nfs_address = args.nfs_address
-        logger.info(f"Using provided NFS address: {nfs_address}")
-    else:
-        try:
-            # Try to detect NFS server from /galaxy/server/database/ mount
-            import subprocess
-            result = subprocess.run(['mount'], capture_output=True, text=True)
-            nfs_address = None
-
-            for line in result.stdout.split('\n'):
-                if '/galaxy/server/database' in line and ':' in line:
-                    # Look for NFS mount pattern: server:/path on /galaxy/server/database
-                    parts = line.split()
-                    for part in parts:
-                        if ':' in part and part.count(':') == 1:
-                            nfs_address = part.split(':')[0]
-                            break
-                    if nfs_address:
-                        logger.info(f"Detected NFS address from mount: {nfs_address}")
-                        break
-
-            if not nfs_address:
-                # Fallback: try to parse /proc/mounts
-                try:
-                    with open('/proc/mounts', 'r') as f:
-                        for line in f:
-                            if '/galaxy/server/database' in line and ':' in line:
-                                parts = line.split()
-                                if len(parts) > 0 and ':' in parts[0]:
-                                    nfs_address = parts[0].split(':')[0]
-                                    logger.info(f"Detected NFS address from /proc/mounts: {nfs_address}")
-                                    break
-                except:
-                    pass
-
-            if not nfs_address:
-                raise ValueError("Could not auto-detect NFS server address from /galaxy/server/database/ mount")
-
-            logger.info(f"Auto-detected NFS address from mount: {nfs_address}")
-        except Exception as e:
-            logger.error(f"Failed to auto-detect NFS address: {e}")
-            raise
-
-    # time.sleep(10000)
+    # Determine target host and port based on test type
+    try:
+        target_host, target_port = determine_test_target(args)
+        logger.info(f"Target determined: {target_host}:{target_port}")
+    except Exception as e:
+        logger.error(f"Failed to determine target: {e}")
+        raise
 
     job_name = f'netcat-job-{uuid.uuid4()}'
     logger.info(f"Generated job name: {job_name}")
@@ -107,9 +178,84 @@
     runnable = batch_v1.Runnable()
     runnable.container = batch_v1.Runnable.Container()
     runnable.container.image_uri = "afgane/gcp-batch-netcat:0.2.0"
-    runnable.container.entrypoint = "/usr/bin/nc"
-    runnable.container.commands = ["-z", "-v", nfs_address, "2049"]
-    logger.debug(f"Container config: image={runnable.container.image_uri}, entrypoint={runnable.container.entrypoint}, commands={runnable.container.commands}")
+
+    # Create a comprehensive test script
+    test_script = f'''#!/bin/bash
+set -e
+echo "=== GCP Batch Connectivity Test ==="
+echo "Test Type: {args.test_type}"
+echo "Target: {target_host}:{target_port}"
+echo "Timestamp: $(date)"
+echo "Container hostname: $(hostname)"
+echo ""
+
+# Basic network info
+echo "=== Network Information ==="
+echo "Container IP addresses:"
+hostname -I
+echo "Default route:"
+ip route | grep default || echo "No default route found"
+echo ""
+
+# DNS configuration
+echo "=== DNS Configuration ==="
+echo "DNS servers:"
+cat /etc/resolv.conf | grep nameserver || echo "No nameservers found"
+echo ""
+
+# Test DNS resolution of target
+echo "=== DNS Resolution Test ==="
+echo "Resolving {target_host}:"
+nslookup {target_host} || {{
+    echo "DNS resolution failed for {target_host}"
+    echo "Trying with Google DNS (8.8.8.8):"
+    nslookup {target_host} 8.8.8.8 || echo "DNS resolution failed even with Google DNS"
+}}
+echo ""
+
+# Basic connectivity test
+echo "=== Primary Connectivity Test ==="
+echo "Testing connection to {target_host}:{target_port}..."
+timeout 30 nc -z -v -w 10 {target_host} {target_port}
+nc_result=$?
+echo "Netcat result: $nc_result"
+echo ""
+
+# Additional connectivity tests
+echo "=== Additional Connectivity Tests ==="
+echo "Testing Google DNS (8.8.8.8:53):"
+timeout 10 nc -z -v -w 5 8.8.8.8 53 && echo "✓ External DNS reachable" || echo "✗ External DNS unreachable"
+
+echo "Testing Kubernetes API (if accessible):"
+timeout 10 nc -z -v -w 5 kubernetes.default.svc.cluster.local 443 2>/dev/null && echo "✓ Kubernetes API reachable" || echo "✗ Kubernetes API unreachable"
+
+echo ""
+echo "=== Network Troubleshooting ==="
+echo "Route table:"
+ip route
+echo ""
+echo "ARP table:"
+arp -a 2>/dev/null || echo "ARP command not available"
+echo ""
+
+echo "=== Final Result ==="
+if [ $nc_result -eq 0 ]; then
+    echo "✓ SUCCESS: Connection to {target_host}:{target_port} successful"
+    exit 0
+else
+    echo "✗ FAILED: Connection to {target_host}:{target_port} failed"
+    echo "This suggests a network connectivity issue between GCP Batch and the target service."
+    echo "Common causes:"
+    echo "- Firewall rules blocking traffic"
+    echo "- Service not accessible from external networks"
+    echo "- Target service only accepting internal cluster traffic"
+    exit 1
+fi
+'''
+
+    runnable.container.entrypoint = "/bin/bash"
+    runnable.container.commands = ["-c", test_script]
+    logger.debug(f"Container config: image={runnable.container.image_uri}, entrypoint={runnable.container.entrypoint}")
 
     task = batch_v1.TaskSpec()
     task.runnables = [runnable]
@@ -154,7 +300,7 @@
     logger.info(f"Submitting job with name: {job_name}")
     logger.info(f"Target project: {project_id}")
     logger.info(f"Target Batch region: {args.region}")
-    logger.info(f"NFS target: {nfs_address}:2049")
+    logger.info(f"Test target: {target_host}:{target_port}")
 
     # Proceed with job submission
     try:
@@ -171,7 +317,10 @@
             f.write(f"Job UID: {job_response.uid}\n")
             f.write(f"Project: {project_id}\n")
             f.write(f"Region: {args.region}\n")
-            f.write(f"NFS Address: {nfs_address}:2049\n")
+            f.write(f"Test Type: {args.test_type}\n")
+            f.write(f"Target: {target_host}:{target_port}\n")
+            f.write(f"\nTo view job logs, run:\n")
+            f.write(f"gcloud logging read 'resource.type=gce_instance AND resource.labels.instance_id={job_name}' --project={project_id}\n")
 
     except Exception as e:
         logger.error(f"Error submitting job: {type(e).__name__}: {e}")
@@ -185,6 +334,8 @@
             f.write(f"Job name: {job_name}\n")
             f.write(f"Project: {project_id}\n")
             f.write(f"Region: {args.region}\n")
+            f.write(f"Test Type: {args.test_type}\n")
+            f.write(f"Target: {target_host}:{target_port}\n")
             f.write(f"Traceback:\n")
             f.write(traceback.format_exc())
 
--- a/gcp_batch_netcat.xml	Tue Jul 22 14:47:47 2025 +0000
+++ b/gcp_batch_netcat.xml	Thu Jul 24 21:41:18 2025 +0000
@@ -1,20 +1,27 @@
-<tool id="gcp_batch_netcat" name="GCP Batch Netcat" version="0.1.1">
-    <description>Submit a job to GCP Batch and connect to an NFS server.</description>
+<tool id="gcp_batch_netcat" name="GCP Batch Netcat" version="0.2.0">
+    <description>Submit a job to GCP Batch to test network connectivity.</description>
     <requirements>
-        <!-- <requirement type="package" version="529.0.0">google-cloud-sdk</requirement>
-        <requirement type="package" version="0.7.1">netcat</requirement> -->
         <container type="docker">afgane/gcp-batch-netcat:0.2.0</container>
     </requirements>
     <command><![CDATA[
-python3 '$__tool_directory__/gcp_batch_netcat.py' --nfs_address '$nfs_address' --output '$output' --project '$project' --region '$region' --service_account_key '$service_account_key' --network '$network' --subnet '$subnet'
+python3 '$__tool_directory__/gcp_batch_netcat.py'
+--output '$output'
+--project '$project'
+--region '$region'
+--service_account_key '$service_account_key'
+--network '$network'
+--subnet '$subnet'
+#if $nfs_address
+    --nfs_address '$nfs_address'
+#end if
     ]]></command>
     <inputs>
-        <param name="region" type="text" label="GCP Batch Region" optional="false"/>
-        <param name="network" type="text" label="GCP Network name" optional="false"/>
-        <param name="subnet" type="text" label="GCP Subnet name" optional="false"/>
-        <param name="nfs_address" type="text" label="NFS Server Address" help="The address of the NFS server to connect to. If not provided, will be auto-detected." />
+        <param name="region" type="text" label="GCP Batch Region" optional="false" help="Region where the Batch job will run (e.g., us-central1)"/>
+        <param name="network" type="text" label="GCP Network name" optional="false" help="VPC network name where Galaxy is deployed"/>
+        <param name="subnet" type="text" label="GCP Subnet name" optional="false" help="Subnet name where Galaxy is deployed"/>
         <param name="service_account_key" type="data" format="json" label="GCP Service Account Key File" help="JSON key file for GCP service account with Batch API permissions"/>
-        <param name="project" type="text" label="GCP Project ID" help="The ID of the GCP project to use. If not provided, will be extracted from the service account key."/>
+        <param name="project" type="text" label="GCP Project ID" help="The ID of the GCP project to use. If not provided, will be extracted from the service account key." optional="true"/>
+        <param name="nfs_address" type="text" label="NFS Server Address" help="The address of the NFS server to connect to. Default is the LoadBalancer external IP. If not provided, will be auto-detected from Galaxy's database mount." optional="false"/>
     </inputs>
     <outputs>
         <data name="output" format="txt"/>
@@ -22,6 +29,45 @@
     <help><![CDATA[
 **What it does**
 
-Submits a job to GCP Batch that connects to the specified NFS server using netcat.
+This enhanced tool submits a job to GCP Batch to test network connectivity between Batch workers and your NFS server. It runs comprehensive network debugging tests to help identify connectivity issues in Galaxy deployments on Google Kubernetes Engine (GKE).
+
+**Enhanced Debugging Features**
+
+The tool now provides detailed network analysis including:
+
+* **Network Interface Information**: IP addresses, routing tables
+* **DNS Configuration and Testing**: Nameserver configuration and resolution tests
+* **Primary Connectivity Test**: Connection to NFS server on port 2049
+* **Port Scanning**: Advanced port connectivity testing with nmap
+* **Additional Tests**: External connectivity (Google DNS), Kubernetes API access
+* **Network Path Tracing**: Traceroute to identify routing issues
+* **Socket Statistics**: Active network connections
+* **System Information**: Container OS, kernel, architecture details
+
+**Troubleshooting Network Issues**
+
+This tool is particularly useful when Galaxy jobs fail to access shared storage. The comprehensive test output helps identify the root cause:
+
+* **Connection timeouts**: Usually indicates firewall rules blocking traffic or services not accessible from external networks
+* **DNS resolution failures**: May indicate DNS configuration issues
+* **External connectivity works but NFS fails**: Suggests NFS service is ClusterIP type (internal only)
+
+**Viewing Detailed Results**
+
+* **Basic results**: Available in the Galaxy output file
+* **Detailed debugging logs**: Check Google Cloud Logging for comprehensive network analysis:
+  ```
+  gcloud logging read 'resource.type=gce_instance AND jsonPayload.job_name="netcat-job-xyz"' --project=your-project
+  ```
+
+**Common Solutions**
+
+If the test fails to connect to your NFS server:
+
+1. **Use NodePort service** to expose NFS externally for testing
+2. **Configure LoadBalancer** with proper firewall rules for production
+3. **Consider Google Cloud Filestore** for managed NFS storage
+
+The enhanced debugging output will guide you to the specific networking issue and solution.
     ]]></help>
 </tool>