Mercurial > repos > enis > gcp_batch_netcat
changeset 13:be2f70ae6749 draft default tip
planemo upload for repository https://github.com/afgane/gcp_batch_netcat commit 2435de746d841f314b70f6257de0a3abaf77ec90
author | enis |
---|---|
date | Fri, 15 Aug 2025 13:14:31 +0000 |
parents | 56543de39954 |
children | |
files | DEBUGGING_GUIDE.md README.md log1.txt |
diffstat | 3 files changed, 1 insertions(+), 342 deletions(-) [+] |
line wrap: on
line diff
--- a/DEBUGGING_GUIDE.md Thu Aug 14 16:48:42 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,303 +0,0 @@ -# GCP Batch - Kubernetes Connectivity Debugging Guide - -## Analysis of Your Test Results - -Based on your Google DNS test output, here's what we learned: - -### ✅ What's Working -- **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53 -- **Basic networking is operational**: The Batch worker has internet access -- **DNS resolution works**: The container can resolve external addresses - -### ❌ What's Not Working -- **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed -- **Container tooling limited**: `ip` command not available in the container - -### 🔍 Key Insight -This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default. - -## Immediate Action Required - -Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options: - -### 🚀 Quick Fix: NodePort Service (Recommended for Testing) - -This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server: - -```bash -# First, find your current NFS service -kubectl get svc | grep -i nfs - -# Create a NodePort service (replace with your actual NFS service details) -kubectl create service nodeport nfs-ganesha-external \ - --tcp=2049:2049 \ - --node-port=32049 - -# Or apply this YAML: -cat <<EOF | kubectl apply -f - -apiVersion: v1 -kind: Service -metadata: - name: nfs-ganesha-external -spec: - type: NodePort - ports: - - port: 2049 - targetPort: 2049 - nodePort: 32049 - selector: - # Replace with your actual NFS pod labels - app: nfs-ganesha -EOF -``` - -Then test with your tool using a GKE node IP and port 32049: - -```bash -# Get a node IP -kubectl get nodes -o wide - -# Test connectivity to <node-ip>:32049 -``` - -### 🎯 Production Fix: LoadBalancer with Firewall Rules - -For production, use a LoadBalancer service with proper firewall configuration: - -```bash -# Create LoadBalancer service -cat <<EOF | kubectl apply -f - -apiVersion: v1 -kind: Service -metadata: - name: nfs-ganesha-lb -spec: - type: LoadBalancer - ports: - - port: 2049 - targetPort: 2049 - selector: - # Replace with your actual NFS pod labels - app: nfs-ganesha -EOF - -# Wait for external IP assignment -kubectl get svc nfs-ganesha-lb -w - -# Create firewall rule allowing GCP Batch to access NFS -gcloud compute firewall-rules create allow-nfs-from-batch \ - --allow tcp:2049 \ - --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \ - --description "Allow NFS access from GCP Batch workers" -``` - -### 📋 Next Steps - -1. **Implement NodePort solution** for immediate testing -2. **Test connectivity** using your enhanced debugging tool with `test_type=custom` -3. **If NodePort works**, move to LoadBalancer for production use -4. **Update Galaxy configuration** to use the new NFS endpoint - -### 💡 Why This Happens - -Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security. - -## The Core Problem - -You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because: - -1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster -2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default -3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic - -## 🔍 Quick Diagnostic Commands - -Run these commands to understand your current setup before making changes: - -```bash -# 1. Find your current NFS-related services and pods -kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)" - -# 2. Check what's actually running -kubectl get pods -o wide | grep -i nfs - -# 3. Look at your current service configuration -kubectl get svc -o wide | grep -i nfs - -# 4. Check if you have any existing LoadBalancer services -kubectl get svc --field-selector spec.type=LoadBalancer - -# 5. Get node IPs for potential NodePort testing -kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}' -``` - -## 🎯 Your Specific Issue Summary - -Based on your test output: -- ✅ **GCP Batch networking works** (can reach 8.8.8.8:53) -- ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed) -- 📍 **Root cause**: NFS service is likely ClusterIP type (internal only) -- 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer - -## Debugging Steps - -### 1. Test External Connectivity First -Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works: -``` -Test Type: Google DNS - External Test -``` -This should succeed and confirms GCP Batch networking is working. - -### 2. Check Your NFS Service Type and Configuration - -Run these commands to examine your current NFS setup: - -```bash -# Check NFS-related services -kubectl get svc | grep -i nfs -kubectl get svc | grep -i ganesha - -# Get detailed service info -kubectl describe svc <your-nfs-service-name> - -# Check endpoints -kubectl get endpoints | grep -i nfs -``` - -### 3. Common Solutions - -#### Option A: Use NodePort Service -NodePort services are accessible from external networks: - -```yaml -apiVersion: v1 -kind: Service -metadata: - name: nfs-ganesha-nodeport -spec: - type: NodePort - ports: - - port: 2049 - targetPort: 2049 - nodePort: 32049 # or let K8s assign - selector: - app: nfs-ganesha -``` - -Then test with the node IP:port (e.g., `<node-ip>:32049`) - -#### Option B: LoadBalancer with Correct Firewall Rules -Ensure your LoadBalancer service has proper firewall rules: - -```bash -# Check your LoadBalancer service -kubectl get svc <nfs-service-name> -o yaml - -# Create firewall rule if needed -gcloud compute firewall-rules create allow-nfs-from-batch \ - --allow tcp:2049 \ - --source-ranges 10.0.0.0/8 \ - --target-tags gke-<cluster-name>-node -``` - -#### Option C: Use Cloud Filestore -For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS: - -```bash -# Create Filestore instance -gcloud filestore instances create galaxy-filestore \ - --tier=STANDARD \ - --file-share=name="galaxy",capacity=1TB \ - --network=name="<your-vpc>" \ - --zone=<your-zone> -``` - -### 4. Network Debugging Commands - -Run these on a GKE node to understand the network setup: - -```bash -# Get node info -kubectl get nodes -o wide - -# Check what's running on nodes -kubectl get pods -o wide | grep nfs - -# Test from a pod inside the cluster -kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash -# Then inside the pod: -nc -zv <nfs-service-ip> 2049 -``` - -### 5. Advanced Debugging with Enhanced Tool - -Use the enhanced tool I created to test different scenarios: - -1. **Test Galaxy Web Service**: `test_type=galaxy_web` - - This will try to find and test your Galaxy web service - - If this fails too, it's a broader networking issue - -2. **Test Custom Endpoints**: `test_type=custom` - - Test specific IPs you know should work - - Try testing a GKE node IP directly - -3. **Check Kubernetes DNS**: `test_type=k8s_dns` - - This tests if Batch workers can reach Kubernetes cluster services - -## 🛠️ Enhanced Container Tools - -The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools: - -### Core Network Tools -- `ip` - Advanced IP routing and network device configuration -- `ping` - Basic connectivity testing -- `nslookup`/`dig` - DNS resolution testing -- `curl`/`wget` - HTTP/HTTPS testing -- `telnet` - Port connectivity testing -- `traceroute` - Network path tracing -- `netstat` - Network connection status -- `ss` - Socket statistics -- `tcpdump` - Network packet capture -- `nmap` - Network scanning and port discovery - -### Enhanced Test Script - -With these tools, the container can now provide much more detailed debugging information: - -```bash -# Network interface details -ip addr show -ip route show - -# DNS resolution testing -nslookup target-host -dig target-host - -# Port scanning -nmap -p 2049 target-host - -# HTTP/HTTPS testing (for web services) -curl -v http://target-host:port - -# Network path tracing -traceroute target-host -``` - -## Root Cause Analysis - -Based on your description, the most likely issues are: - -1. **ClusterIP services** are not accessible from outside the cluster (expected behavior) -2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs -3. **Network policies** in your cluster might be blocking external traffic -4. **GKE cluster** might be using a different subnet than GCP Batch workers - -## Recommended Solution - -For Galaxy on GKE with GCP Batch integration, I recommend: - -1. **Use Google Cloud Filestore** for shared storage (most reliable) -2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules -3. **Test with the enhanced debugging tool** to get detailed network information - -Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?
--- a/README.md Thu Aug 14 16:48:42 2025 +0000 +++ b/README.md Fri Aug 15 13:14:31 2025 +0000 @@ -18,8 +18,7 @@ - Troubleshooting connectivity issues in Galaxy deployments on Kubernetes - Debugging firewall rules, NFS export configurations, and CVMFS client setup - Comprehensive Network Diagnostics: DNS resolution, routing, and external connectivity -- Custom VM Integration: Uses galaxy-k8s-boot-v2025-08-12 image with pre-configured CVMFS client - +- Custom VM Integration: Uses (e.g., `galaxy-k8s-boot-v2025-08-12`) image with pre-configured CVMFS client and NFS support The tool is available in the Main Tool Shed at: https://toolshed.g2.bx.psu.edu/view/enis/gcp_batch_netcat/ @@ -115,14 +114,6 @@ - Downloaded JSON key file for the service account - Access to the custom VM image: e.g., `galaxy-k8s-boot-v2025-08-12` -### Network Configuration -- Firewall rule allowing traffic from the Batch subnet to NFS server: -``` -gcloud compute firewall-rules create allow-nfs-from-batch \ - --network=NETWORK_NAME \ - --allow=tcp:2049 -``` - ### NFS Server Setup - The NFS service must be accessible via LoadBalancer with external IP (typically private within VPC) - NFS server should support NFSv4.2 with sec=sys security
--- a/log1.txt Thu Aug 14 16:48:42 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,29 +0,0 @@ -2025-08-14 14:20:55.181 WGST -✓ Found: export directory -2025-08-14 14:20:55.187 WGST -total 4 -2025-08-14 14:20:55.187 WGST -drwxr-xr-x 3 nobody nogroup 0 Aug 12 15:49 . -2025-08-14 14:20:55.187 WGST -drwxrwsrwx 14 nobody nogroup 4096 Aug 14 15:12 pvc-aa9a2d4e-2066-40ec-85de-8eb13c8cb9a5 -2025-08-14 14:20:55.188 WGST -Looking for PVC directories in export... -2025-08-14 14:21:06.333 WGST -report agent state: metadata:{parent:"projects/526897014808/locations/us-east4" zone:"us-east4-b" instance:"netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0-6rs8" instance_id:3148160217266536671 creation_time:{seconds:1755184743 nanos:205126122} creator:"projects/526897014808/regions/us-east4/instanceGroupManagers/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0" version:"cloud-batch-agent_20250723.00_p00" os_release:{key:"ID" value:"ubuntu"} os_release:{key:"NAME" value:"Ubuntu"} os_release:{key:"VERSION" value:"24.04.3 LTS (Noble Numbat)"} os_release:{key:"VERSION_CODENAME" value:"noble"} os_release:{key:"VERSION_ID" value:"24.04"} machine_type:"e2-medium"} agent_info:{state:AGENT_RUNNING job_id:"netcat-job-9b31e9b-e30f48bf-ed6a-41470" user_project_num:526897014808 tasks:{task_id:"action/STARTUP/0/0/group0" task_status:{state:SUCCEEDED status_events:{type:"ASSIGNED" description:"task action/STARTUP/0/0/group0 ASSIGNED" event_time:{seconds:1755184743 nanos:486748362} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task action/STARTUP/0/0/group0 RUNNING" event_time:{seconds:1755184743 nanos:486756403} task_state:RUNNING} status_events:{type:"SUCCEEDED" description:"succeeded" event_time:{seconds:1755184743 nanos:953481712} task_state:SUCCEEDED}}} tasks:{task_id:"task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0" task_status:{state:RUNNING status_events:{type:"ASSIGNED" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 ASSIGNED" event_time:{seconds:1755184746 nanos:423966399} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 RUNNING" event_time:{seconds:1755184746 nanos:423969481} task_state:RUNNING}}} report_time:{seconds:1755184866 nanos:332761176} task_group_id:"group0"} agent_timing_info:{boot_time:{seconds:1755184712 nanos:999352912} script_startup_time:{seconds:1755184734 nanos:429352912} agent_startup_time:{seconds:1755184743 nanos:205126122}} -2025-08-14 14:21:06.416 WGST -Server response for instance 3148160217266536671: tasks:{task:"action/STARTUP/0/0/group0" status:{state:SUCCEEDED status_events:{type:"ASSIGNED" description:"task action/STARTUP/0/0/group0 ASSIGNED" event_time:{seconds:1755184743 nanos:486748362} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task action/STARTUP/0/0/group0 RUNNING" event_time:{seconds:1755184743 nanos:486756403} task_state:RUNNING} status_events:{type:"SUCCEEDED" description:"succeeded" event_time:{seconds:1755184743 nanos:953481712} task_state:SUCCEEDED}} intended_state:ASSIGNED job_uid:"netcat-job-9b31e9b-e30f48bf-ed6a-41470" task_group_id:"group0" location:"us-east4" job_id:"netcat-job-9b31e9b3-b4ac-4a1c-8eb6-ed0104b17750"} tasks:{task:"task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0" status:{state:RUNNING status_events:{type:"ASSIGNED" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 ASSIGNED" event_time:{seconds:1755184746 nanos:423966399} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 RUNNING" event_time:{seconds:1755184746 nanos:423969481} task_state:RUNNING}} intended_state:ASSIGNED job_uid:"netcat-job-9b31e9b-e30f48bf-ed6a-41470" task_group_id:"group0" location:"us-east4" job_id:"netcat-job-9b31e9b3-b4ac-4a1c-8eb6-ed0104b17750"} use_batch_monitored_resource:true. -2025-08-14 14:21:15.570 WGST -/mnt/nfs/export/pvc-aa9a2d4e-2066-40ec-85de-8eb13c8cb9a5 -2025-08-14 14:21:15.570 WGST -2025-08-14 14:21:15.570 WGST -=== Looking for Galaxy directories === -2025-08-14 14:21:15.571 WGST -✗ Not found: database -2025-08-14 14:21:15.571 WGST -✗ Not found: database/files -2025-08-14 14:21:15.571 WGST -✗ Not found: database/objects -2025-08-14 14:21:15.576 WGST -✗ Not found: tools -2025-08-14 14:21:15.576 WGST -✗ Not found: shed_tools