Mercurial > repos > bimib > cobraxy
comparison COBRAxy/docs/tools/marea-cluster.md @ 492:4ed95023af20 draft
Uploaded
| author | francesco_lapi |
|---|---|
| date | Tue, 30 Sep 2025 14:02:17 +0000 |
| parents | |
| children | fcdbc81feb45 |
comparison
equal
deleted
inserted
replaced
| 491:7a413a5ec566 | 492:4ed95023af20 |
|---|---|
| 1 # MAREA Cluster | |
| 2 | |
| 3 Perform clustering analysis on metabolic data to identify sample groups and patterns. | |
| 4 | |
| 5 ## Overview | |
| 6 | |
| 7 MAREA Cluster performs unsupervised clustering analysis on RAS, RPS, or flux data to identify natural groupings among samples. It supports multiple clustering algorithms (K-means, DBSCAN, Hierarchical) with optional data scaling and validation metrics including elbow plots and silhouette analysis. | |
| 8 | |
| 9 ## Usage | |
| 10 | |
| 11 ### Command Line | |
| 12 | |
| 13 ```bash | |
| 14 marea_cluster -td /path/to/COBRAxy \ | |
| 15 -in metabolic_data.tsv \ | |
| 16 -cy kmeans \ | |
| 17 -sc true \ | |
| 18 -k1 2 \ | |
| 19 -k2 8 \ | |
| 20 -el true \ | |
| 21 -si true \ | |
| 22 -idop clustering_results/ \ | |
| 23 -ol cluster.log | |
| 24 ``` | |
| 25 | |
| 26 ### Galaxy Interface | |
| 27 | |
| 28 Select "MAREA Cluster" from the COBRAxy tool suite and configure clustering parameters through the web interface. | |
| 29 | |
| 30 ## Parameters | |
| 31 | |
| 32 ### Required Parameters | |
| 33 | |
| 34 | Parameter | Flag | Description | | |
| 35 |-----------|------|-------------| | |
| 36 | Tool Directory | `-td, --tool_dir` | Path to COBRAxy installation directory | | |
| 37 | Input Data | `-in, --input` | Metabolic data file (TSV format) | | |
| 38 | |
| 39 ### Clustering Parameters | |
| 40 | |
| 41 | Parameter | Flag | Description | Default | | |
| 42 |-----------|------|-------------|---------| | |
| 43 | Cluster Type | `-cy, --cluster_type` | Clustering algorithm | kmeans | | |
| 44 | Data Scaling | `-sc, --scaling` | Apply data normalization | true | | |
| 45 | Minimum K | `-k1, --k_min` | Minimum number of clusters | 2 | | |
| 46 | Maximum K | `-k2, --k_max` | Maximum number of clusters | 7 | | |
| 47 | |
| 48 ### Analysis Options | |
| 49 | |
| 50 | Parameter | Flag | Description | Default | | |
| 51 |-----------|------|-------------|---------| | |
| 52 | Elbow Plot | `-el, --elbow` | Generate elbow plot for K-means | false | | |
| 53 | Silhouette Analysis | `-si, --silhouette` | Generate silhouette plots | false | | |
| 54 | |
| 55 ### DBSCAN Specific Parameters | |
| 56 | |
| 57 | Parameter | Flag | Description | Default | | |
| 58 |-----------|------|-------------|---------| | |
| 59 | Min Samples | `-ms, --min_samples` | Minimum samples per cluster | - | | |
| 60 | Epsilon | `-ep, --eps` | Maximum distance between samples | - | | |
| 61 | |
| 62 ### Output Parameters | |
| 63 | |
| 64 | Parameter | Flag | Description | Default | | |
| 65 |-----------|------|-------------|---------| | |
| 66 | Output Path | `-idop, --output_path` | Results directory | clustering/ | | |
| 67 | Output Log | `-ol, --out_log` | Log file path | - | | |
| 68 | Best Cluster | `-bc, --best_cluster` | Best clustering result file | - | | |
| 69 | |
| 70 ## Clustering Algorithms | |
| 71 | |
| 72 ### K-means | |
| 73 **Method**: Partitional clustering using centroids | |
| 74 - Assumes spherical clusters | |
| 75 - Requires pre-specified number of clusters (k) | |
| 76 - Fast and scalable | |
| 77 - Works well with normalized data | |
| 78 | |
| 79 **Best for**: | |
| 80 - Well-separated, compact clusters | |
| 81 - Large datasets | |
| 82 - When cluster number is approximately known | |
| 83 | |
| 84 ### DBSCAN | |
| 85 **Method**: Density-based clustering | |
| 86 - Identifies clusters of varying shapes | |
| 87 - Automatically determines cluster number | |
| 88 - Robust to outliers and noise | |
| 89 - Requires epsilon and min_samples parameters | |
| 90 | |
| 91 **Best for**: | |
| 92 - Irregular cluster shapes | |
| 93 - Datasets with noise/outliers | |
| 94 - Unknown number of clusters | |
| 95 | |
| 96 ### Hierarchical | |
| 97 **Method**: Agglomerative clustering with dendrograms | |
| 98 - Creates tree-like cluster hierarchy | |
| 99 - No need to specify cluster number initially | |
| 100 - Deterministic results | |
| 101 - Provides multiple resolution levels | |
| 102 | |
| 103 **Best for**: | |
| 104 - Small to medium datasets | |
| 105 - When cluster hierarchy is important | |
| 106 - Exploratory analysis | |
| 107 | |
| 108 ## Input Format | |
| 109 | |
| 110 ### Metabolic Data File | |
| 111 | |
| 112 Tab-separated format with samples as rows and reactions/metabolites as columns: | |
| 113 | |
| 114 ``` | |
| 115 Sample R00001 R00002 R00003 R00004 ... | |
| 116 Sample1 1.25 0.85 1.42 0.78 ... | |
| 117 Sample2 0.65 1.35 0.72 1.28 ... | |
| 118 Sample3 2.15 2.05 0.45 0.52 ... | |
| 119 Control1 1.05 0.98 1.15 1.08 ... | |
| 120 Control2 0.95 1.12 0.88 0.92 ... | |
| 121 ``` | |
| 122 | |
| 123 **Requirements**: | |
| 124 - First column: sample identifiers | |
| 125 - Subsequent columns: feature values (RAS, RPS, fluxes) | |
| 126 - Missing values: use 0 or leave empty | |
| 127 - Numeric data only (excluding sample names) | |
| 128 | |
| 129 ## Data Preprocessing | |
| 130 | |
| 131 ### Scaling Options | |
| 132 | |
| 133 #### Standard Scaling (Recommended) | |
| 134 - Mean centering and unit variance scaling | |
| 135 - Formula: `(x - mean) / std` | |
| 136 - Ensures equal feature contribution | |
| 137 - Required for distance-based algorithms | |
| 138 | |
| 139 #### No Scaling | |
| 140 - Use original data values | |
| 141 - May be appropriate for already normalized data | |
| 142 - Risk of feature dominance by high-magnitude variables | |
| 143 | |
| 144 ### Feature Selection | |
| 145 | |
| 146 Consider preprocessing steps: | |
| 147 - Remove low-variance features | |
| 148 - Apply dimensionality reduction (PCA) | |
| 149 - Select most variable reactions/metabolites | |
| 150 - Handle missing data appropriately | |
| 151 | |
| 152 ## Output Files | |
| 153 | |
| 154 ### Cluster Assignments | |
| 155 | |
| 156 #### Best Clustering Result (`best_clusters.tsv`) | |
| 157 ``` | |
| 158 Sample Cluster Silhouette_Score | |
| 159 Sample1 1 0.73 | |
| 160 Sample2 1 0.68 | |
| 161 Sample3 2 0.81 | |
| 162 Control1 0 0.59 | |
| 163 Control2 0 0.62 | |
| 164 ``` | |
| 165 | |
| 166 #### All K Results (`clustering_results_k{n}.tsv`) | |
| 167 Individual files for each tested cluster number. | |
| 168 | |
| 169 ### Validation Metrics | |
| 170 | |
| 171 #### Elbow Plot (`elbow_plot.png`) | |
| 172 - X-axis: Number of clusters (k) | |
| 173 - Y-axis: Within-cluster sum of squares (WCSS) | |
| 174 - Identifies optimal k at the "elbow" point | |
| 175 | |
| 176 #### Silhouette Plots (`silhouette_k{n}.png`) | |
| 177 - Individual sample silhouette scores | |
| 178 - Average silhouette width per cluster | |
| 179 - Overall clustering quality assessment | |
| 180 | |
| 181 ### Summary Statistics | |
| 182 | |
| 183 #### Clustering Summary (`clustering_summary.txt`) | |
| 184 ``` | |
| 185 Algorithm: kmeans | |
| 186 Scaling: true | |
| 187 Optimal K: 3 | |
| 188 Best Silhouette Score: 0.72 | |
| 189 Number of Samples: 20 | |
| 190 Feature Dimensions: 150 | |
| 191 ``` | |
| 192 | |
| 193 #### Cluster Characteristics (`cluster_stats.tsv`) | |
| 194 ``` | |
| 195 Cluster Size Centroid_R00001 Centroid_R00002 Avg_Silhouette | |
| 196 0 8 0.95 1.12 0.68 | |
| 197 1 7 1.35 0.82 0.74 | |
| 198 2 5 0.65 1.55 0.69 | |
| 199 ``` | |
| 200 | |
| 201 ## Examples | |
| 202 | |
| 203 ### Basic K-means Clustering | |
| 204 | |
| 205 ```bash | |
| 206 # Simple K-means with elbow analysis | |
| 207 marea_cluster -td /opt/COBRAxy \ | |
| 208 -in ras_data.tsv \ | |
| 209 -cy kmeans \ | |
| 210 -sc true \ | |
| 211 -k1 2 \ | |
| 212 -k2 10 \ | |
| 213 -el true \ | |
| 214 -si true \ | |
| 215 -idop kmeans_results/ \ | |
| 216 -ol kmeans.log | |
| 217 ``` | |
| 218 | |
| 219 ### DBSCAN Analysis | |
| 220 | |
| 221 ```bash | |
| 222 # Density-based clustering with custom parameters | |
| 223 marea_cluster -td /opt/COBRAxy \ | |
| 224 -in flux_samples.tsv \ | |
| 225 -cy dbscan \ | |
| 226 -sc true \ | |
| 227 -ms 5 \ | |
| 228 -ep 0.5 \ | |
| 229 -idop dbscan_results/ \ | |
| 230 -bc best_dbscan_clusters.tsv \ | |
| 231 -ol dbscan.log | |
| 232 ``` | |
| 233 | |
| 234 ### Hierarchical Clustering | |
| 235 | |
| 236 ```bash | |
| 237 # Hierarchical clustering for small dataset | |
| 238 marea_cluster -td /opt/COBRAxy \ | |
| 239 -in rps_scores.tsv \ | |
| 240 -cy hierarchy \ | |
| 241 -sc true \ | |
| 242 -k1 2 \ | |
| 243 -k2 6 \ | |
| 244 -si true \ | |
| 245 -idop hierarchical_results/ \ | |
| 246 -ol hierarchy.log | |
| 247 ``` | |
| 248 | |
| 249 ### Comprehensive Clustering Analysis | |
| 250 | |
| 251 ```bash | |
| 252 # Compare multiple algorithms | |
| 253 algorithms=("kmeans" "dbscan" "hierarchy") | |
| 254 for alg in "${algorithms[@]}"; do | |
| 255 marea_cluster -td /opt/COBRAxy \ | |
| 256 -in metabolomics_data.tsv \ | |
| 257 -cy "$alg" \ | |
| 258 -sc true \ | |
| 259 -k1 2 \ | |
| 260 -k2 8 \ | |
| 261 -el true \ | |
| 262 -si true \ | |
| 263 -idop "${alg}_clustering/" \ | |
| 264 -ol "${alg}_cluster.log" | |
| 265 done | |
| 266 ``` | |
| 267 | |
| 268 ## Parameter Optimization | |
| 269 | |
| 270 ### K-means Optimization | |
| 271 | |
| 272 #### Elbow Method | |
| 273 1. Run K-means for k = 2 to k_max | |
| 274 2. Plot WCSS vs k | |
| 275 3. Identify "elbow" point where improvement diminishes | |
| 276 4. Select k at elbow as optimal | |
| 277 | |
| 278 #### Silhouette Analysis | |
| 279 1. Compute silhouette scores for each k | |
| 280 2. Select k with highest average silhouette score | |
| 281 3. Validate with silhouette plots | |
| 282 4. Ensure clusters are well-separated | |
| 283 | |
| 284 ### DBSCAN Parameter Tuning | |
| 285 | |
| 286 #### Epsilon (eps) Selection | |
| 287 - Use k-distance plot to identify knee point | |
| 288 - Start with eps = average distance to k-th nearest neighbor | |
| 289 - Adjust based on cluster quality metrics | |
| 290 | |
| 291 #### Min Samples Selection | |
| 292 - Rule of thumb: min_samples ≥ dimensionality + 1 | |
| 293 - Higher values create denser clusters | |
| 294 - Lower values may increase noise sensitivity | |
| 295 | |
| 296 ### Hierarchical Clustering | |
| 297 | |
| 298 #### Linkage Method | |
| 299 - Ward: Minimizes within-cluster variance | |
| 300 - Complete: Maximum distance between clusters | |
| 301 - Average: Mean distance between clusters | |
| 302 - Single: Minimum distance (prone to chaining) | |
| 303 | |
| 304 ## Quality Assessment | |
| 305 | |
| 306 ### Internal Validation Metrics | |
| 307 | |
| 308 #### Silhouette Score | |
| 309 - Range: [-1, 1] | |
| 310 - >0.7: Strong clustering | |
| 311 - 0.5-0.7: Reasonable clustering | |
| 312 - <0.5: Weak clustering | |
| 313 | |
| 314 #### Calinski-Harabasz Index | |
| 315 - Higher values indicate better clustering | |
| 316 - Ratio of between-cluster to within-cluster variance | |
| 317 | |
| 318 #### Davies-Bouldin Index | |
| 319 - Lower values indicate better clustering | |
| 320 - Average similarity between clusters | |
| 321 | |
| 322 ### External Validation | |
| 323 | |
| 324 When ground truth labels available: | |
| 325 - Adjusted Rand Index (ARI) | |
| 326 - Normalized Mutual Information (NMI) | |
| 327 - Homogeneity and Completeness scores | |
| 328 | |
| 329 ## Biological Interpretation | |
| 330 | |
| 331 ### Cluster Characterization | |
| 332 | |
| 333 #### Metabolic Pathway Analysis | |
| 334 - Identify enriched pathways per cluster | |
| 335 - Compare metabolic profiles between clusters | |
| 336 - Relate clusters to biological conditions | |
| 337 | |
| 338 #### Sample Annotation | |
| 339 - Map clusters to experimental conditions | |
| 340 - Identify batch effects or confounders | |
| 341 - Validate with independent datasets | |
| 342 | |
| 343 #### Feature Importance | |
| 344 - Determine reactions/metabolites driving clustering | |
| 345 - Analyze cluster centroids for biological insights | |
| 346 - Connect to known metabolic phenotypes | |
| 347 | |
| 348 ## Integration Workflow | |
| 349 | |
| 350 ### Upstream Data Sources | |
| 351 | |
| 352 #### COBRAxy Tools | |
| 353 - [RAS Generator](ras-generator.md) - Cluster based on reaction activities | |
| 354 - [RPS Generator](rps-generator.md) - Cluster based on reaction propensities | |
| 355 - [Flux Simulation](flux-simulation.md) - Cluster flux distributions | |
| 356 | |
| 357 #### External Data | |
| 358 - Gene expression matrices | |
| 359 - Metabolomics datasets | |
| 360 - Clinical metadata | |
| 361 | |
| 362 ### Downstream Analysis | |
| 363 | |
| 364 #### Supervised Learning | |
| 365 Use cluster labels for: | |
| 366 - Classification model training | |
| 367 - Biomarker discovery | |
| 368 - Outcome prediction | |
| 369 | |
| 370 #### Differential Analysis | |
| 371 - Compare clusters with [MAREA](marea.md) | |
| 372 - Identify cluster-specific metabolic signatures | |
| 373 - Pathway enrichment analysis | |
| 374 | |
| 375 ### Typical Pipeline | |
| 376 | |
| 377 ```bash | |
| 378 # 1. Generate metabolic scores | |
| 379 ras_generator -td /opt/COBRAxy -in expression.tsv -ra ras.tsv | |
| 380 | |
| 381 # 2. Perform clustering analysis | |
| 382 marea_cluster -td /opt/COBRAxy -in ras.tsv -cy kmeans \ | |
| 383 -sc true -k1 2 -k2 8 -el true -si true \ | |
| 384 -idop clusters/ -bc best_clusters.tsv | |
| 385 | |
| 386 # 3. Analyze cluster differences | |
| 387 marea -td /opt/COBRAxy -input_data ras.tsv \ | |
| 388 -input_class best_clusters.tsv -comparison manyvsmany \ | |
| 389 -test ks -choice_map ENGRO2 -idop cluster_analysis/ | |
| 390 ``` | |
| 391 | |
| 392 ## Tips and Best Practices | |
| 393 | |
| 394 ### Data Preparation | |
| 395 - **Normalization**: Always scale features for distance-based methods | |
| 396 - **Dimensionality**: Consider PCA for high-dimensional data (>1000 features) | |
| 397 - **Missing Values**: Handle appropriately (imputation or removal) | |
| 398 - **Outliers**: Identify and consider removal for K-means | |
| 399 | |
| 400 ### Algorithm Selection | |
| 401 - **K-means**: Start here for most applications | |
| 402 - **DBSCAN**: Use when clusters have irregular shapes or noise present | |
| 403 - **Hierarchical**: Choose for small datasets or when hierarchy matters | |
| 404 | |
| 405 ### Parameter Selection | |
| 406 - **Start Simple**: Begin with default parameters | |
| 407 - **Use Validation**: Always employ silhouette analysis | |
| 408 - **Cross-Validate**: Test stability across parameter ranges | |
| 409 - **Biological Validation**: Ensure clusters make biological sense | |
| 410 | |
| 411 ### Result Interpretation | |
| 412 - **Multiple Algorithms**: Compare results across methods | |
| 413 - **Stability Assessment**: Check clustering reproducibility | |
| 414 - **Biological Context**: Integrate with known sample characteristics | |
| 415 - **Statistical Testing**: Validate cluster differences formally | |
| 416 | |
| 417 ## Troubleshooting | |
| 418 | |
| 419 ### Common Issues | |
| 420 | |
| 421 **Poor clustering quality** | |
| 422 - Check data scaling and normalization | |
| 423 - Assess feature selection and dimensionality | |
| 424 - Try different algorithms or parameters | |
| 425 - Evaluate data structure with PCA/t-SNE | |
| 426 | |
| 427 **Algorithm doesn't converge** | |
| 428 - Increase iteration limits for K-means | |
| 429 - Adjust epsilon/min_samples for DBSCAN | |
| 430 - Check for numerical stability issues | |
| 431 - Verify input data format | |
| 432 | |
| 433 **Memory or performance issues** | |
| 434 - Reduce dataset size or dimensionality | |
| 435 - Use sampling for large datasets | |
| 436 - Consider approximate algorithms | |
| 437 - Monitor system resources | |
| 438 | |
| 439 ### Error Messages | |
| 440 | |
| 441 | Error | Cause | Solution | | |
| 442 |-------|-------|----------| | |
| 443 | "Convergence failed" | K-means iteration limit | Increase max iterations or check data | | |
| 444 | "No clusters found" | DBSCAN parameters too strict | Reduce eps or min_samples | | |
| 445 | "Memory allocation error" | Dataset too large | Reduce size or increase memory | | |
| 446 | "Invalid silhouette score" | Single cluster found | Adjust parameters or algorithm | | |
| 447 | |
| 448 ### Performance Optimization | |
| 449 | |
| 450 **Large Datasets** | |
| 451 - Use mini-batch K-means for speed | |
| 452 - Sample data for parameter optimization | |
| 453 - Employ dimensionality reduction | |
| 454 - Consider distributed computing | |
| 455 | |
| 456 **High-Dimensional Data** | |
| 457 - Apply feature selection | |
| 458 - Use PCA preprocessing | |
| 459 - Consider specialized algorithms | |
| 460 - Validate results carefully | |
| 461 | |
| 462 ## Advanced Usage | |
| 463 | |
| 464 ### Custom Distance Metrics | |
| 465 | |
| 466 For specialized applications, modify distance calculations: | |
| 467 | |
| 468 ```python | |
| 469 # Custom distance function for metabolic data | |
| 470 def metabolic_distance(x, y): | |
| 471 # Implement pathway-aware distance metric | |
| 472 return custom_distance_value | |
| 473 ``` | |
| 474 | |
| 475 ### Ensemble Clustering | |
| 476 | |
| 477 Combine multiple clustering results: | |
| 478 | |
| 479 ```bash | |
| 480 # Run multiple algorithms and combine | |
| 481 for method in kmeans dbscan hierarchy; do | |
| 482 marea_cluster -cy $method -in data.tsv -idop ${method}_results/ | |
| 483 done | |
| 484 | |
| 485 # Consensus clustering (requires custom script) | |
| 486 python consensus_clustering.py -i *_results/best_clusters.tsv -o consensus.tsv | |
| 487 ``` | |
| 488 | |
| 489 ### Interactive Analysis | |
| 490 | |
| 491 Generate interactive plots for exploration: | |
| 492 | |
| 493 ```python | |
| 494 import plotly.express as px | |
| 495 import pandas as pd | |
| 496 | |
| 497 # Load clustering results | |
| 498 results = pd.read_csv('best_clusters.tsv', sep='\t') | |
| 499 data = pd.read_csv('metabolic_data.tsv', sep='\t') | |
| 500 | |
| 501 # Interactive scatter plot | |
| 502 fig = px.scatter(data, x='PC1', y='PC2', color=results['Cluster']) | |
| 503 fig.show() | |
| 504 ``` | |
| 505 | |
| 506 ## See Also | |
| 507 | |
| 508 - [MAREA](marea.md) - Statistical analysis of cluster differences | |
| 509 - [RAS Generator](ras-generator.md) - Generate clustering input data | |
| 510 - [Flux Simulation](flux-simulation.md) - Alternative clustering data source | |
| 511 - [Clustering Tutorial](../tutorials/clustering-analysis.md) | |
| 512 - [Validation Methods Reference](../tutorials/cluster-validation.md) |
