Mercurial > repos > enis > gcp_batch_netcat

--- a/DEBUGGING_GUIDE.md	Thu Aug 14 16:48:42 2025 +0000
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,303 +0,0 @@
-# GCP Batch - Kubernetes Connectivity Debugging Guide
-
-## Analysis of Your Test Results
-
-Based on your Google DNS test output, here's what we learned:
-
-### ✅ What's Working
-- **External connectivity is functional**: GCP Batch can reach 8.8.8.8:53
-- **Basic networking is operational**: The Batch worker has internet access
-- **DNS resolution works**: The container can resolve external addresses
-
-### ❌ What's Not Working
-- **Kubernetes API unreachable**: `kubernetes.default.svc.cluster.local:443` failed
-- **Container tooling limited**: `ip` command not available in the container
-
-### 🔍 Key Insight
-This confirms the **core networking issue**: GCP Batch workers (external VMs) cannot reach Kubernetes cluster-internal services, even when in the same VPC. This is **expected behavior** - Kubernetes services are not accessible from outside the cluster by default.
-
-## Immediate Action Required
-
-Since Kubernetes services aren't accessible from GCP Batch, you need to expose your NFS service externally. Here are your options:
-
-### 🚀 Quick Fix: NodePort Service (Recommended for Testing)
-
-This is the fastest way to test connectivity. Create a NodePort service that exposes your NFS server:
-
-```bash
-# First, find your current NFS service
-kubectl get svc | grep -i nfs
-
-# Create a NodePort service (replace with your actual NFS service details)
-kubectl create service nodeport nfs-ganesha-external \
-  --tcp=2049:2049 \
-  --node-port=32049
-
-# Or apply this YAML:
-cat <<EOF | kubectl apply -f -
-apiVersion: v1
-kind: Service
-metadata:
-  name: nfs-ganesha-external
-spec:
-  type: NodePort
-  ports:
-  - port: 2049
-    targetPort: 2049
-    nodePort: 32049
-  selector:
-    # Replace with your actual NFS pod labels
-    app: nfs-ganesha
-EOF
-```
-
-Then test with your tool using a GKE node IP and port 32049:
-
-```bash
-# Get a node IP
-kubectl get nodes -o wide
-
-# Test connectivity to <node-ip>:32049
-```
-
-### 🎯 Production Fix: LoadBalancer with Firewall Rules
-
-For production, use a LoadBalancer service with proper firewall configuration:
-
-```bash
-# Create LoadBalancer service
-cat <<EOF | kubectl apply -f -
-apiVersion: v1
-kind: Service
-metadata:
-  name: nfs-ganesha-lb
-spec:
-  type: LoadBalancer
-  ports:
-  - port: 2049
-    targetPort: 2049
-  selector:
-    # Replace with your actual NFS pod labels
-    app: nfs-ganesha
-EOF
-
-# Wait for external IP assignment
-kubectl get svc nfs-ganesha-lb -w
-
-# Create firewall rule allowing GCP Batch to access NFS
-gcloud compute firewall-rules create allow-nfs-from-batch \
-  --allow tcp:2049 \
-  --source-ranges 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
-  --description "Allow NFS access from GCP Batch workers"
-```
-
-### 📋 Next Steps
-
-1. **Implement NodePort solution** for immediate testing
-2. **Test connectivity** using your enhanced debugging tool with `test_type=custom`
-3. **If NodePort works**, move to LoadBalancer for production use
-4. **Update Galaxy configuration** to use the new NFS endpoint
-
-### 💡 Why This Happens
-
-Your test results confirm what we suspected: GCP Batch workers are essentially external VMs that cannot access Kubernetes ClusterIP services. This is standard Kubernetes behavior - internal services are isolated from external networks for security.
-
-## The Core Problem
-
-You're experiencing a classic networking issue where GCP Batch workers (running outside your Kubernetes cluster) cannot reach services inside the cluster, even when they're in the same VPC/subnet. This is because:
-
-1. **GCP Batch runs on Compute Engine VMs** outside your GKE cluster
-2. **Kubernetes services** (like NFS ClusterIP services) are only accessible from within the cluster by default
-3. **LoadBalancer services** should work, but there might be firewall rules blocking traffic
-
-## 🔍 Quick Diagnostic Commands
-
-Run these commands to understand your current setup before making changes:
-
-```bash
-# 1. Find your current NFS-related services and pods
-kubectl get svc,pods | grep -i -E "(nfs|ganesha|storage)"
-
-# 2. Check what's actually running
-kubectl get pods -o wide | grep -i nfs
-
-# 3. Look at your current service configuration
-kubectl get svc -o wide | grep -i nfs
-
-# 4. Check if you have any existing LoadBalancer services
-kubectl get svc --field-selector spec.type=LoadBalancer
-
-# 5. Get node IPs for potential NodePort testing
-kubectl get nodes -o wide --no-headers | awk '{print $1 "\t" $7}'
-```
-
-## 🎯 Your Specific Issue Summary
-
-Based on your test output:
-- ✅ **GCP Batch networking works** (can reach 8.8.8.8:53)
-- ❌ **Cannot reach Kubernetes services** (kubernetes.default.svc.cluster.local:443 failed)
-- 📍 **Root cause**: NFS service is likely ClusterIP type (internal only)
-- 🔧 **Solution**: Expose NFS externally via NodePort or LoadBalancer
-
-## Debugging Steps
-
-### 1. Test External Connectivity First
-Use the enhanced tool with `test_type=google_dns` to verify basic connectivity works:
-```
-Test Type: Google DNS - External Test
-```
-This should succeed and confirms GCP Batch networking is working.
-
-### 2. Check Your NFS Service Type and Configuration
-
-Run these commands to examine your current NFS setup:
-
-```bash
-# Check NFS-related services
-kubectl get svc | grep -i nfs
-kubectl get svc | grep -i ganesha
-
-# Get detailed service info
-kubectl describe svc <your-nfs-service-name>
-
-# Check endpoints
-kubectl get endpoints | grep -i nfs
-```
-
-### 3. Common Solutions
-
-#### Option A: Use NodePort Service
-NodePort services are accessible from external networks:
-
-```yaml
-apiVersion: v1
-kind: Service
-metadata:
-  name: nfs-ganesha-nodeport
-spec:
-  type: NodePort
-  ports:
-  - port: 2049
-    targetPort: 2049
-    nodePort: 32049  # or let K8s assign
-  selector:
-    app: nfs-ganesha
-```
-
-Then test with the node IP:port (e.g., `<node-ip>:32049`)
-
-#### Option B: LoadBalancer with Correct Firewall Rules
-Ensure your LoadBalancer service has proper firewall rules:
-
-```bash
-# Check your LoadBalancer service
-kubectl get svc <nfs-service-name> -o yaml
-
-# Create firewall rule if needed
-gcloud compute firewall-rules create allow-nfs-from-batch \
-  --allow tcp:2049 \
-  --source-ranges 10.0.0.0/8 \
-  --target-tags gke-<cluster-name>-node
-```
-
-#### Option C: Use Cloud Filestore
-For production Galaxy deployments, consider using Google Cloud Filestore instead of in-cluster NFS:
-
-```bash
-# Create Filestore instance
-gcloud filestore instances create galaxy-filestore \
-  --tier=STANDARD \
-  --file-share=name="galaxy",capacity=1TB \
-  --network=name="<your-vpc>" \
-  --zone=<your-zone>
-```
-
-### 4. Network Debugging Commands
-
-Run these on a GKE node to understand the network setup:
-
-```bash
-# Get node info
-kubectl get nodes -o wide
-
-# Check what's running on nodes
-kubectl get pods -o wide | grep nfs
-
-# Test from a pod inside the cluster
-kubectl run debug-pod --image=nicolaka/netshoot -it --rm -- /bin/bash
-# Then inside the pod:
-nc -zv <nfs-service-ip> 2049
-```
-
-### 5. Advanced Debugging with Enhanced Tool
-
-Use the enhanced tool I created to test different scenarios:
-
-1. **Test Galaxy Web Service**: `test_type=galaxy_web`
-   - This will try to find and test your Galaxy web service
-   - If this fails too, it's a broader networking issue
-
-2. **Test Custom Endpoints**: `test_type=custom`
-   - Test specific IPs you know should work
-   - Try testing a GKE node IP directly
-
-3. **Check Kubernetes DNS**: `test_type=k8s_dns`
-   - This tests if Batch workers can reach Kubernetes cluster services
-
-## 🛠️ Enhanced Container Tools
-
-The updated Docker container (`afgane/gcp-batch-netcat:0.2.0`) now includes comprehensive networking tools:
-
-### Core Network Tools
-- `ip` - Advanced IP routing and network device configuration
-- `ping` - Basic connectivity testing
-- `nslookup`/`dig` - DNS resolution testing
-- `curl`/`wget` - HTTP/HTTPS testing
-- `telnet` - Port connectivity testing
-- `traceroute` - Network path tracing
-- `netstat` - Network connection status
-- `ss` - Socket statistics
-- `tcpdump` - Network packet capture
-- `nmap` - Network scanning and port discovery
-
-### Enhanced Test Script
-
-With these tools, the container can now provide much more detailed debugging information:
-
-```bash
-# Network interface details
-ip addr show
-ip route show
-
-# DNS resolution testing
-nslookup target-host
-dig target-host
-
-# Port scanning
-nmap -p 2049 target-host
-
-# HTTP/HTTPS testing (for web services)
-curl -v http://target-host:port
-
-# Network path tracing
-traceroute target-host
-```
-
-## Root Cause Analysis
-
-Based on your description, the most likely issues are:
-
-1. **ClusterIP services** are not accessible from outside the cluster (expected behavior)
-2. **LoadBalancer services** might have firewall rules blocking GCP Batch source IPs
-3. **Network policies** in your cluster might be blocking external traffic
-4. **GKE cluster** might be using a different subnet than GCP Batch workers
-
-## Recommended Solution
-
-For Galaxy on GKE with GCP Batch integration, I recommend:
-
-1. **Use Google Cloud Filestore** for shared storage (most reliable)
-2. **If using in-cluster NFS**, expose it via NodePort or LoadBalancer with proper firewall rules
-3. **Test with the enhanced debugging tool** to get detailed network information
-
-Would you like me to help you implement any of these solutions or analyze the output from the enhanced debugging tool?
--- a/README.md	Thu Aug 14 16:48:42 2025 +0000
+++ b/README.md	Fri Aug 15 13:14:31 2025 +0000
@@ -18,8 +18,7 @@
 - Troubleshooting connectivity issues in Galaxy deployments on Kubernetes
 - Debugging firewall rules, NFS export configurations, and CVMFS client setup
 - Comprehensive Network Diagnostics: DNS resolution, routing, and external connectivity
-- Custom VM Integration: Uses galaxy-k8s-boot-v2025-08-12 image with pre-configured CVMFS client
-
+- Custom VM Integration: Uses (e.g., `galaxy-k8s-boot-v2025-08-12`) image with pre-configured CVMFS client and NFS support
 The tool is available in the Main Tool Shed at:
 https://toolshed.g2.bx.psu.edu/view/enis/gcp_batch_netcat/

@@ -115,14 +114,6 @@
 - Downloaded JSON key file for the service account
 - Access to the custom VM image: e.g., `galaxy-k8s-boot-v2025-08-12`

-### Network Configuration
-- Firewall rule allowing traffic from the Batch subnet to NFS server:
-```
-gcloud compute firewall-rules create allow-nfs-from-batch \
-  --network=NETWORK_NAME \
-  --allow=tcp:2049
-```
-
 ### NFS Server Setup
 - The NFS service must be accessible via LoadBalancer with external IP (typically private within VPC)
 - NFS server should support NFSv4.2 with sec=sys security
--- a/log1.txt	Thu Aug 14 16:48:42 2025 +0000
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,29 +0,0 @@
-2025-08-14 14:20:55.181 WGST
-✓ Found: export directory
-2025-08-14 14:20:55.187 WGST
-total 4
-2025-08-14 14:20:55.187 WGST
-drwxr-xr-x 3 nobody nogroup 0 Aug 12 15:49 .
-2025-08-14 14:20:55.187 WGST
-drwxrwsrwx 14 nobody nogroup 4096 Aug 14 15:12 pvc-aa9a2d4e-2066-40ec-85de-8eb13c8cb9a5
-2025-08-14 14:20:55.188 WGST
-Looking for PVC directories in export...
-2025-08-14 14:21:06.333 WGST
-report agent state: metadata:{parent:"projects/526897014808/locations/us-east4" zone:"us-east4-b" instance:"netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0-6rs8" instance_id:3148160217266536671 creation_time:{seconds:1755184743 nanos:205126122} creator:"projects/526897014808/regions/us-east4/instanceGroupManagers/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0" version:"cloud-batch-agent_20250723.00_p00" os_release:{key:"ID" value:"ubuntu"} os_release:{key:"NAME" value:"Ubuntu"} os_release:{key:"VERSION" value:"24.04.3 LTS (Noble Numbat)"} os_release:{key:"VERSION_CODENAME" value:"noble"} os_release:{key:"VERSION_ID" value:"24.04"} machine_type:"e2-medium"} agent_info:{state:AGENT_RUNNING job_id:"netcat-job-9b31e9b-e30f48bf-ed6a-41470" user_project_num:526897014808 tasks:{task_id:"action/STARTUP/0/0/group0" task_status:{state:SUCCEEDED status_events:{type:"ASSIGNED" description:"task action/STARTUP/0/0/group0 ASSIGNED" event_time:{seconds:1755184743 nanos:486748362} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task action/STARTUP/0/0/group0 RUNNING" event_time:{seconds:1755184743 nanos:486756403} task_state:RUNNING} status_events:{type:"SUCCEEDED" description:"succeeded" event_time:{seconds:1755184743 nanos:953481712} task_state:SUCCEEDED}}} tasks:{task_id:"task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0" task_status:{state:RUNNING status_events:{type:"ASSIGNED" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 ASSIGNED" event_time:{seconds:1755184746 nanos:423966399} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 RUNNING" event_time:{seconds:1755184746 nanos:423969481} task_state:RUNNING}}} report_time:{seconds:1755184866 nanos:332761176} task_group_id:"group0"} agent_timing_info:{boot_time:{seconds:1755184712 nanos:999352912} script_startup_time:{seconds:1755184734 nanos:429352912} agent_startup_time:{seconds:1755184743 nanos:205126122}}
-2025-08-14 14:21:06.416 WGST
-Server response for instance 3148160217266536671: tasks:{task:"action/STARTUP/0/0/group0" status:{state:SUCCEEDED status_events:{type:"ASSIGNED" description:"task action/STARTUP/0/0/group0 ASSIGNED" event_time:{seconds:1755184743 nanos:486748362} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task action/STARTUP/0/0/group0 RUNNING" event_time:{seconds:1755184743 nanos:486756403} task_state:RUNNING} status_events:{type:"SUCCEEDED" description:"succeeded" event_time:{seconds:1755184743 nanos:953481712} task_state:SUCCEEDED}} intended_state:ASSIGNED job_uid:"netcat-job-9b31e9b-e30f48bf-ed6a-41470" task_group_id:"group0" location:"us-east4" job_id:"netcat-job-9b31e9b3-b4ac-4a1c-8eb6-ed0104b17750"} tasks:{task:"task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0" status:{state:RUNNING status_events:{type:"ASSIGNED" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 ASSIGNED" event_time:{seconds:1755184746 nanos:423966399} task_state:ASSIGNED} status_events:{type:"RUNNING" description:"task task/netcat-job-9b31e9b-e30f48bf-ed6a-41470-group0-0/0/0 RUNNING" event_time:{seconds:1755184746 nanos:423969481} task_state:RUNNING}} intended_state:ASSIGNED job_uid:"netcat-job-9b31e9b-e30f48bf-ed6a-41470" task_group_id:"group0" location:"us-east4" job_id:"netcat-job-9b31e9b3-b4ac-4a1c-8eb6-ed0104b17750"} use_batch_monitored_resource:true.
-2025-08-14 14:21:15.570 WGST
-/mnt/nfs/export/pvc-aa9a2d4e-2066-40ec-85de-8eb13c8cb9a5
-2025-08-14 14:21:15.570 WGST
-2025-08-14 14:21:15.570 WGST
-=== Looking for Galaxy directories ===
-2025-08-14 14:21:15.571 WGST
-✗ Not found: database
-2025-08-14 14:21:15.571 WGST
-✗ Not found: database/files
-2025-08-14 14:21:15.571 WGST
-✗ Not found: database/objects
-2025-08-14 14:21:15.576 WGST
-✗ Not found: tools
-2025-08-14 14:21:15.576 WGST
-✗ Not found: shed_tools