# HG changeset patch # User enis # Date 1755263671 0 # Node ID be2f70ae67490567c1dbae4ee615f6f2ccef7f66 # Parent 56543de399541658adb35f9e8a18faaf4d95979a planemo upload for repository https://github.com/afgane/gcp_batch_netcat commit 2435de746d841f314b70f6257de0a3abaf77ec90 diff -r 56543de39954 -r be2f70ae6749 DEBUGGING_GUIDE.md --- a/DEBUGGING_GUIDE.md Thu Aug 14 16:48:42 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,303 +0,0 @@ -# GCP Batch - Kubernetes Connectivity Debugging Guide - -## Analysis of Your Test Results - -Based on your Google DNS test output, here's what we learned: - -### ✅ What's Working -- **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53 -- **Basic networking is operational**: The Batch worker has internet access -- **DNS resolution works**: The container can resolve external addresses - -### ❌ What's Not Working -- **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed -- **Container tooling limited**: `ip` command not available in the container - -### 🔍 Key Insight -This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default. - -## Immediate Action Required - -Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options: - -### 🚀 Quick Fix: NodePort Service (Recommended for Testing) - -This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server: - -```bash -# First, find your current NFS service -kubectl get svc | grep -i nfs - -# Create a NodePort service (replace with your actual NFS service details) -kubectl create service nodeport nfs-ganesha-external \ - --tcp=2049:2049 \ - --node-port=32049 - -# Or apply this YAML: -cat <:32049 -``` - -### 🎯 Production Fix: LoadBalancer with Firewall Rules - -For production, use a LoadBalancer service with proper firewall configuration: - -```bash -# Create LoadBalancer service -cat < - -# Check endpoints -kubectl get endpoints | grep -i nfs -``` - -### 3. Common Solutions - -#### Option A: Use NodePort Service -NodePort services are accessible from external networks: - -```yaml -apiVersion: v1 -kind: Service -metadata: - name: nfs-ganesha-nodeport -spec: - type: NodePort - ports: - - port: 2049 - targetPort: 2049 - nodePort: 32049 # or let K8s assign - selector: - app: nfs-ganesha -``` - -Then test with the node IP:port (e.g., `:32049`) - -#### Option B: LoadBalancer with Correct Firewall Rules -Ensure your LoadBalancer service has proper firewall rules: - -```bash -# Check your LoadBalancer service -kubectl get svc -o yaml - -# Create firewall rule if needed -gcloud compute firewall-rules create allow-nfs-from-batch \ - --allow tcp:2049 \ - --source-ranges 10.0.0.0/8 \ - --target-tags gke--node -``` - -#### Option C: Use Cloud Filestore -For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS: - -```bash -# Create Filestore instance -gcloud filestore instances create galaxy-filestore \ - --tier=STANDARD \ - --file-share=name="galaxy",capacity=1TB \ - --network=name="" \ - --zone= -``` - -### 4. Network Debugging Commands - -Run these on a GKE node to understand the network setup: - -```bash -# Get node info -kubectl get nodes -o wide - -# Check what's running on nodes -kubectl get pods -o wide | grep nfs - -# Test from a pod inside the cluster -kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash -# Then inside the pod: -nc -zv 2049 -``` - -### 5. Advanced Debugging with Enhanced Tool - -Use the enhanced tool I created to test different scenarios: - -1. **Test Galaxy Web Service**: `test_type=galaxy_web` - - This will try to find and test your Galaxy web service - - If this fails too, it's a broader networking issue - -2. **Test Custom Endpoints**: `test_type=custom` - - Test specific IPs you know should work - - Try testing a GKE node IP directly - -3. **Check Kubernetes DNS**: `test_type=k8s_dns` - - This tests if Batch workers can reach Kubernetes cluster services - -## 🛠️ Enhanced Container Tools - -The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools: - -### Core Network Tools -- `ip` - Advanced IP routing and network device configuration -- `ping` - Basic connectivity testing -- `nslookup`/`dig` - DNS resolution testing -- `curl`/`wget` - HTTP/HTTPS testing -- `telnet` - Port connectivity testing -- `traceroute` - Network path tracing -- `netstat` - Network connection status -- `ss` - Socket statistics -- `tcpdump` - Network packet capture -- `nmap` - Network scanning and port discovery - -### Enhanced Test Script - -With these tools, the container can now provide much more detailed debugging information: - -```bash -# Network interface details -ip addr show -ip route show - -# DNS resolution testing -nslookup target-host -dig target-host - -# Port scanning -nmap -p 2049 target-host - -# HTTP/HTTPS testing (for web services) -curl -v http://target-host:port - -# Network path tracing -traceroute target-host -``` - -## Root Cause Analysis - -Based on your description, the most likely issues are: - -1. **ClusterIP services** are not accessible from outside the cluster (expected behavior) -2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs -3. **Network policies** in your cluster might be blocking external traffic -4. **GKE cluster** might be using a different subnet than GCP Batch workers - -## Recommended Solution - -For Galaxy on GKE with GCP Batch integration, I recommend: - -1. **Use Google Cloud Filestore** for shared storage (most reliable) -2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules -3. **Test with the enhanced debugging tool** to get detailed network information - -Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool? diff -r 56543de39954 -r be2f70ae6749 README.md --- a/README.md Thu Aug 14 16:48:42 2025 +0000 +++ b/README.md Fri Aug 15 13:14:31 2025 +0000 @@ -18,8 +18,7 @@ - Troubleshooting connectivity issues in Galaxy deployments on Kubernetes - Debugging firewall rules, NFS export configurations, and CVMFS client setup - Comprehensive Network Diagnostics: DNS resolution, routing, and external connectivity -- Custom VM Integration: Uses galaxy-k8s-boot-v2025-08-12 image with pre-configured CVMFS client - +- Custom VM Integration: Uses (e.g., `galaxy-k8s-boot-v2025-08-12`) image with pre-configured CVMFS client and NFS support The tool is available in the Main Tool Shed at: https://toolshed.g2.bx.psu.edu/view/enis/gcp_batch_netcat/ @@ -115,14 +114,6 @@ - Downloaded JSON key file for the service account - Access to the custom VM image: e.g., `galaxy-k8s-boot-v2025-08-12` -### Network Configuration -- Firewall rule allowing traffic from the Batch subnet to NFS server: -``` -gcloud compute firewall-rules create allow-nfs-from-batch \ - --network=NETWORK_NAME \ - --allow=tcp:2049 -``` - ### NFS Server Setup - The NFS service must be accessible via LoadBalancer with external IP (typically private within VPC) - NFS server should support NFSv4.2 with sec=sys security diff -r 56543de39954 -r be2f70ae6749 log1.txt --- a/log1.txt Thu Aug 14 16:48:42 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,29 +0,0 @@ -2025-08-14 14:20:55.181 WGST -✓ Found: export directory -2025-08-14 14:20:55.187 WGST -total 4 -2025-08-14 14:20:55.187 WGST -drwxr-xr-x 3 nobody nogroup 0 Aug 12 15:49 . -2025-08-14 14:20:55.187 WGST -drwxrwsrwx 14 nobody nogroup 4096 Aug 14 15:12 pvc-aa9a2d4e-2066-40ec-85de-8eb13c8cb9a5 -2025-08-14 14:20:55.188 WGST -Looking for PVC directories in export... -2025-08-14 14:21:06.333 WGST -report agent state: metadata:{parent:"projects/526897014808/locations/us-east4" zone:"us-east4-b" instance:"netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0-6rs8" instance_id:3148160217266536671 creation_time:{seconds:1755184743 nanos:205126122} creator:"projects/526897014808/regions/us-east4/instanceGroupManagers/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0" version:"cloud-batch-agent_20250723.00_p00" os_release:{key:"ID" value:"ubuntu"} os_release:{key:"NAME" value:"Ubuntu"} os_release:{key:"VERSION" value:"24.04.3 LTS (Noble Numbat)"} os_release:{key:"VERSION_CODENAME" value:"noble"} os_release:{key:"VERSION_ID" value:"24.04"} machine_type:"e2-medium"} agent_info:{state:AGENT_RUNNING job_id:"netcat-job-9b31e9b-e30f48bf-ed6a-41470" user_project_num:526897014808 tasks:{task_id:"action/STARTUP/0/0/group0" task_status:{state:SUCCEEDED status_events:{type:"ASSIGNED" description:"task action/STARTUP/0/0/group0 ASSIGNED" event_time:{seconds:1755184743 nanos:486748362} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task action/STARTUP/0/0/group0 RUNNING" event_time:{seconds:1755184743 nanos:486756403} task_state:RUNNING} status_events:{type:"SUCCEEDED" description:"succeeded" event_time:{seconds:1755184743 nanos:953481712} task_state:SUCCEEDED}}} tasks:{task_id:"task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0" task_status:{state:RUNNING status_events:{type:"ASSIGNED" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 ASSIGNED" event_time:{seconds:1755184746 nanos:423966399} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 RUNNING" event_time:{seconds:1755184746 nanos:423969481} task_state:RUNNING}}} report_time:{seconds:1755184866 nanos:332761176} task_group_id:"group0"} agent_timing_info:{boot_time:{seconds:1755184712 nanos:999352912} script_startup_time:{seconds:1755184734 nanos:429352912} agent_startup_time:{seconds:1755184743 nanos:205126122}} -2025-08-14 14:21:06.416 WGST -Server response for instance 3148160217266536671: tasks:{task:"action/STARTUP/0/0/group0" status:{state:SUCCEEDED status_events:{type:"ASSIGNED" description:"task action/STARTUP/0/0/group0 ASSIGNED" event_time:{seconds:1755184743 nanos:486748362} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task action/STARTUP/0/0/group0 RUNNING" event_time:{seconds:1755184743 nanos:486756403} task_state:RUNNING} status_events:{type:"SUCCEEDED" description:"succeeded" event_time:{seconds:1755184743 nanos:953481712} task_state:SUCCEEDED}} intended_state:ASSIGNED job_uid:"netcat-job-9b31e9b-e30f48bf-ed6a-41470" task_group_id:"group0" location:"us-east4" job_id:"netcat-job-9b31e9b3-b4ac-4a1c-8eb6-ed0104b17750"} tasks:{task:"task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0" status:{state:RUNNING status_events:{type:"ASSIGNED" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 ASSIGNED" event_time:{seconds:1755184746 nanos:423966399} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 RUNNING" event_time:{seconds:1755184746 nanos:423969481} task_state:RUNNING}} intended_state:ASSIGNED job_uid:"netcat-job-9b31e9b-e30f48bf-ed6a-41470" task_group_id:"group0" location:"us-east4" job_id:"netcat-job-9b31e9b3-b4ac-4a1c-8eb6-ed0104b17750"} use_batch_monitored_resource:true. -2025-08-14 14:21:15.570 WGST -/mnt/nfs/export/pvc-aa9a2d4e-2066-40ec-85de-8eb13c8cb9a5 -2025-08-14 14:21:15.570 WGST -2025-08-14 14:21:15.570 WGST -=== Looking for Galaxy directories === -2025-08-14 14:21:15.571 WGST -✗ Not found: database -2025-08-14 14:21:15.571 WGST -✗ Not found: database/files -2025-08-14 14:21:15.571 WGST -✗ Not found: database/objects -2025-08-14 14:21:15.576 WGST -✗ Not found: tools -2025-08-14 14:21:15.576 WGST -✗ Not found: shed_tools