comparison COBRAxy/docs/tools/marea-cluster.md @ 492:4ed95023af20 draft

Uploaded
author francesco_lapi
date Tue, 30 Sep 2025 14:02:17 +0000
parents
children fcdbc81feb45
comparison
equal deleted inserted replaced
491:7a413a5ec566 492:4ed95023af20
1 # MAREA Cluster
2
3 Perform clustering analysis on metabolic data to identify sample groups and patterns.
4
5 ## Overview
6
7 MAREA Cluster performs unsupervised clustering analysis on RAS, RPS, or flux data to identify natural groupings among samples. It supports multiple clustering algorithms (K-means, DBSCAN, Hierarchical) with optional data scaling and validation metrics including elbow plots and silhouette analysis.
8
9 ## Usage
10
11 ### Command Line
12
13 ```bash
14 marea_cluster -td /path/to/COBRAxy \
15 -in metabolic_data.tsv \
16 -cy kmeans \
17 -sc true \
18 -k1 2 \
19 -k2 8 \
20 -el true \
21 -si true \
22 -idop clustering_results/ \
23 -ol cluster.log
24 ```
25
26 ### Galaxy Interface
27
28 Select "MAREA Cluster" from the COBRAxy tool suite and configure clustering parameters through the web interface.
29
30 ## Parameters
31
32 ### Required Parameters
33
34 | Parameter | Flag | Description |
35 |-----------|------|-------------|
36 | Tool Directory | `-td, --tool_dir` | Path to COBRAxy installation directory |
37 | Input Data | `-in, --input` | Metabolic data file (TSV format) |
38
39 ### Clustering Parameters
40
41 | Parameter | Flag | Description | Default |
42 |-----------|------|-------------|---------|
43 | Cluster Type | `-cy, --cluster_type` | Clustering algorithm | kmeans |
44 | Data Scaling | `-sc, --scaling` | Apply data normalization | true |
45 | Minimum K | `-k1, --k_min` | Minimum number of clusters | 2 |
46 | Maximum K | `-k2, --k_max` | Maximum number of clusters | 7 |
47
48 ### Analysis Options
49
50 | Parameter | Flag | Description | Default |
51 |-----------|------|-------------|---------|
52 | Elbow Plot | `-el, --elbow` | Generate elbow plot for K-means | false |
53 | Silhouette Analysis | `-si, --silhouette` | Generate silhouette plots | false |
54
55 ### DBSCAN Specific Parameters
56
57 | Parameter | Flag | Description | Default |
58 |-----------|------|-------------|---------|
59 | Min Samples | `-ms, --min_samples` | Minimum samples per cluster | - |
60 | Epsilon | `-ep, --eps` | Maximum distance between samples | - |
61
62 ### Output Parameters
63
64 | Parameter | Flag | Description | Default |
65 |-----------|------|-------------|---------|
66 | Output Path | `-idop, --output_path` | Results directory | clustering/ |
67 | Output Log | `-ol, --out_log` | Log file path | - |
68 | Best Cluster | `-bc, --best_cluster` | Best clustering result file | - |
69
70 ## Clustering Algorithms
71
72 ### K-means
73 **Method**: Partitional clustering using centroids
74 - Assumes spherical clusters
75 - Requires pre-specified number of clusters (k)
76 - Fast and scalable
77 - Works well with normalized data
78
79 **Best for**:
80 - Well-separated, compact clusters
81 - Large datasets
82 - When cluster number is approximately known
83
84 ### DBSCAN
85 **Method**: Density-based clustering
86 - Identifies clusters of varying shapes
87 - Automatically determines cluster number
88 - Robust to outliers and noise
89 - Requires epsilon and min_samples parameters
90
91 **Best for**:
92 - Irregular cluster shapes
93 - Datasets with noise/outliers
94 - Unknown number of clusters
95
96 ### Hierarchical
97 **Method**: Agglomerative clustering with dendrograms
98 - Creates tree-like cluster hierarchy
99 - No need to specify cluster number initially
100 - Deterministic results
101 - Provides multiple resolution levels
102
103 **Best for**:
104 - Small to medium datasets
105 - When cluster hierarchy is important
106 - Exploratory analysis
107
108 ## Input Format
109
110 ### Metabolic Data File
111
112 Tab-separated format with samples as rows and reactions/metabolites as columns:
113
114 ```
115 Sample R00001 R00002 R00003 R00004 ...
116 Sample1 1.25 0.85 1.42 0.78 ...
117 Sample2 0.65 1.35 0.72 1.28 ...
118 Sample3 2.15 2.05 0.45 0.52 ...
119 Control1 1.05 0.98 1.15 1.08 ...
120 Control2 0.95 1.12 0.88 0.92 ...
121 ```
122
123 **Requirements**:
124 - First column: sample identifiers
125 - Subsequent columns: feature values (RAS, RPS, fluxes)
126 - Missing values: use 0 or leave empty
127 - Numeric data only (excluding sample names)
128
129 ## Data Preprocessing
130
131 ### Scaling Options
132
133 #### Standard Scaling (Recommended)
134 - Mean centering and unit variance scaling
135 - Formula: `(x - mean) / std`
136 - Ensures equal feature contribution
137 - Required for distance-based algorithms
138
139 #### No Scaling
140 - Use original data values
141 - May be appropriate for already normalized data
142 - Risk of feature dominance by high-magnitude variables
143
144 ### Feature Selection
145
146 Consider preprocessing steps:
147 - Remove low-variance features
148 - Apply dimensionality reduction (PCA)
149 - Select most variable reactions/metabolites
150 - Handle missing data appropriately
151
152 ## Output Files
153
154 ### Cluster Assignments
155
156 #### Best Clustering Result (`best_clusters.tsv`)
157 ```
158 Sample Cluster Silhouette_Score
159 Sample1 1 0.73
160 Sample2 1 0.68
161 Sample3 2 0.81
162 Control1 0 0.59
163 Control2 0 0.62
164 ```
165
166 #### All K Results (`clustering_results_k{n}.tsv`)
167 Individual files for each tested cluster number.
168
169 ### Validation Metrics
170
171 #### Elbow Plot (`elbow_plot.png`)
172 - X-axis: Number of clusters (k)
173 - Y-axis: Within-cluster sum of squares (WCSS)
174 - Identifies optimal k at the "elbow" point
175
176 #### Silhouette Plots (`silhouette_k{n}.png`)
177 - Individual sample silhouette scores
178 - Average silhouette width per cluster
179 - Overall clustering quality assessment
180
181 ### Summary Statistics
182
183 #### Clustering Summary (`clustering_summary.txt`)
184 ```
185 Algorithm: kmeans
186 Scaling: true
187 Optimal K: 3
188 Best Silhouette Score: 0.72
189 Number of Samples: 20
190 Feature Dimensions: 150
191 ```
192
193 #### Cluster Characteristics (`cluster_stats.tsv`)
194 ```
195 Cluster Size Centroid_R00001 Centroid_R00002 Avg_Silhouette
196 0 8 0.95 1.12 0.68
197 1 7 1.35 0.82 0.74
198 2 5 0.65 1.55 0.69
199 ```
200
201 ## Examples
202
203 ### Basic K-means Clustering
204
205 ```bash
206 # Simple K-means with elbow analysis
207 marea_cluster -td /opt/COBRAxy \
208 -in ras_data.tsv \
209 -cy kmeans \
210 -sc true \
211 -k1 2 \
212 -k2 10 \
213 -el true \
214 -si true \
215 -idop kmeans_results/ \
216 -ol kmeans.log
217 ```
218
219 ### DBSCAN Analysis
220
221 ```bash
222 # Density-based clustering with custom parameters
223 marea_cluster -td /opt/COBRAxy \
224 -in flux_samples.tsv \
225 -cy dbscan \
226 -sc true \
227 -ms 5 \
228 -ep 0.5 \
229 -idop dbscan_results/ \
230 -bc best_dbscan_clusters.tsv \
231 -ol dbscan.log
232 ```
233
234 ### Hierarchical Clustering
235
236 ```bash
237 # Hierarchical clustering for small dataset
238 marea_cluster -td /opt/COBRAxy \
239 -in rps_scores.tsv \
240 -cy hierarchy \
241 -sc true \
242 -k1 2 \
243 -k2 6 \
244 -si true \
245 -idop hierarchical_results/ \
246 -ol hierarchy.log
247 ```
248
249 ### Comprehensive Clustering Analysis
250
251 ```bash
252 # Compare multiple algorithms
253 algorithms=("kmeans" "dbscan" "hierarchy")
254 for alg in "${algorithms[@]}"; do
255 marea_cluster -td /opt/COBRAxy \
256 -in metabolomics_data.tsv \
257 -cy "$alg" \
258 -sc true \
259 -k1 2 \
260 -k2 8 \
261 -el true \
262 -si true \
263 -idop "${alg}_clustering/" \
264 -ol "${alg}_cluster.log"
265 done
266 ```
267
268 ## Parameter Optimization
269
270 ### K-means Optimization
271
272 #### Elbow Method
273 1. Run K-means for k = 2 to k_max
274 2. Plot WCSS vs k
275 3. Identify "elbow" point where improvement diminishes
276 4. Select k at elbow as optimal
277
278 #### Silhouette Analysis
279 1. Compute silhouette scores for each k
280 2. Select k with highest average silhouette score
281 3. Validate with silhouette plots
282 4. Ensure clusters are well-separated
283
284 ### DBSCAN Parameter Tuning
285
286 #### Epsilon (eps) Selection
287 - Use k-distance plot to identify knee point
288 - Start with eps = average distance to k-th nearest neighbor
289 - Adjust based on cluster quality metrics
290
291 #### Min Samples Selection
292 - Rule of thumb: min_samples ≥ dimensionality + 1
293 - Higher values create denser clusters
294 - Lower values may increase noise sensitivity
295
296 ### Hierarchical Clustering
297
298 #### Linkage Method
299 - Ward: Minimizes within-cluster variance
300 - Complete: Maximum distance between clusters
301 - Average: Mean distance between clusters
302 - Single: Minimum distance (prone to chaining)
303
304 ## Quality Assessment
305
306 ### Internal Validation Metrics
307
308 #### Silhouette Score
309 - Range: [-1, 1]
310 - >0.7: Strong clustering
311 - 0.5-0.7: Reasonable clustering
312 - <0.5: Weak clustering
313
314 #### Calinski-Harabasz Index
315 - Higher values indicate better clustering
316 - Ratio of between-cluster to within-cluster variance
317
318 #### Davies-Bouldin Index
319 - Lower values indicate better clustering
320 - Average similarity between clusters
321
322 ### External Validation
323
324 When ground truth labels available:
325 - Adjusted Rand Index (ARI)
326 - Normalized Mutual Information (NMI)
327 - Homogeneity and Completeness scores
328
329 ## Biological Interpretation
330
331 ### Cluster Characterization
332
333 #### Metabolic Pathway Analysis
334 - Identify enriched pathways per cluster
335 - Compare metabolic profiles between clusters
336 - Relate clusters to biological conditions
337
338 #### Sample Annotation
339 - Map clusters to experimental conditions
340 - Identify batch effects or confounders
341 - Validate with independent datasets
342
343 #### Feature Importance
344 - Determine reactions/metabolites driving clustering
345 - Analyze cluster centroids for biological insights
346 - Connect to known metabolic phenotypes
347
348 ## Integration Workflow
349
350 ### Upstream Data Sources
351
352 #### COBRAxy Tools
353 - [RAS Generator](ras-generator.md) - Cluster based on reaction activities
354 - [RPS Generator](rps-generator.md) - Cluster based on reaction propensities
355 - [Flux Simulation](flux-simulation.md) - Cluster flux distributions
356
357 #### External Data
358 - Gene expression matrices
359 - Metabolomics datasets
360 - Clinical metadata
361
362 ### Downstream Analysis
363
364 #### Supervised Learning
365 Use cluster labels for:
366 - Classification model training
367 - Biomarker discovery
368 - Outcome prediction
369
370 #### Differential Analysis
371 - Compare clusters with [MAREA](marea.md)
372 - Identify cluster-specific metabolic signatures
373 - Pathway enrichment analysis
374
375 ### Typical Pipeline
376
377 ```bash
378 # 1. Generate metabolic scores
379 ras_generator -td /opt/COBRAxy -in expression.tsv -ra ras.tsv
380
381 # 2. Perform clustering analysis
382 marea_cluster -td /opt/COBRAxy -in ras.tsv -cy kmeans \
383 -sc true -k1 2 -k2 8 -el true -si true \
384 -idop clusters/ -bc best_clusters.tsv
385
386 # 3. Analyze cluster differences
387 marea -td /opt/COBRAxy -input_data ras.tsv \
388 -input_class best_clusters.tsv -comparison manyvsmany \
389 -test ks -choice_map ENGRO2 -idop cluster_analysis/
390 ```
391
392 ## Tips and Best Practices
393
394 ### Data Preparation
395 - **Normalization**: Always scale features for distance-based methods
396 - **Dimensionality**: Consider PCA for high-dimensional data (>1000 features)
397 - **Missing Values**: Handle appropriately (imputation or removal)
398 - **Outliers**: Identify and consider removal for K-means
399
400 ### Algorithm Selection
401 - **K-means**: Start here for most applications
402 - **DBSCAN**: Use when clusters have irregular shapes or noise present
403 - **Hierarchical**: Choose for small datasets or when hierarchy matters
404
405 ### Parameter Selection
406 - **Start Simple**: Begin with default parameters
407 - **Use Validation**: Always employ silhouette analysis
408 - **Cross-Validate**: Test stability across parameter ranges
409 - **Biological Validation**: Ensure clusters make biological sense
410
411 ### Result Interpretation
412 - **Multiple Algorithms**: Compare results across methods
413 - **Stability Assessment**: Check clustering reproducibility
414 - **Biological Context**: Integrate with known sample characteristics
415 - **Statistical Testing**: Validate cluster differences formally
416
417 ## Troubleshooting
418
419 ### Common Issues
420
421 **Poor clustering quality**
422 - Check data scaling and normalization
423 - Assess feature selection and dimensionality
424 - Try different algorithms or parameters
425 - Evaluate data structure with PCA/t-SNE
426
427 **Algorithm doesn't converge**
428 - Increase iteration limits for K-means
429 - Adjust epsilon/min_samples for DBSCAN
430 - Check for numerical stability issues
431 - Verify input data format
432
433 **Memory or performance issues**
434 - Reduce dataset size or dimensionality
435 - Use sampling for large datasets
436 - Consider approximate algorithms
437 - Monitor system resources
438
439 ### Error Messages
440
441 | Error | Cause | Solution |
442 |-------|-------|----------|
443 | "Convergence failed" | K-means iteration limit | Increase max iterations or check data |
444 | "No clusters found" | DBSCAN parameters too strict | Reduce eps or min_samples |
445 | "Memory allocation error" | Dataset too large | Reduce size or increase memory |
446 | "Invalid silhouette score" | Single cluster found | Adjust parameters or algorithm |
447
448 ### Performance Optimization
449
450 **Large Datasets**
451 - Use mini-batch K-means for speed
452 - Sample data for parameter optimization
453 - Employ dimensionality reduction
454 - Consider distributed computing
455
456 **High-Dimensional Data**
457 - Apply feature selection
458 - Use PCA preprocessing
459 - Consider specialized algorithms
460 - Validate results carefully
461
462 ## Advanced Usage
463
464 ### Custom Distance Metrics
465
466 For specialized applications, modify distance calculations:
467
468 ```python
469 # Custom distance function for metabolic data
470 def metabolic_distance(x, y):
471 # Implement pathway-aware distance metric
472 return custom_distance_value
473 ```
474
475 ### Ensemble Clustering
476
477 Combine multiple clustering results:
478
479 ```bash
480 # Run multiple algorithms and combine
481 for method in kmeans dbscan hierarchy; do
482 marea_cluster -cy $method -in data.tsv -idop ${method}_results/
483 done
484
485 # Consensus clustering (requires custom script)
486 python consensus_clustering.py -i *_results/best_clusters.tsv -o consensus.tsv
487 ```
488
489 ### Interactive Analysis
490
491 Generate interactive plots for exploration:
492
493 ```python
494 import plotly.express as px
495 import pandas as pd
496
497 # Load clustering results
498 results = pd.read_csv('best_clusters.tsv', sep='\t')
499 data = pd.read_csv('metabolic_data.tsv', sep='\t')
500
501 # Interactive scatter plot
502 fig = px.scatter(data, x='PC1', y='PC2', color=results['Cluster'])
503 fig.show()
504 ```
505
506 ## See Also
507
508 - [MAREA](marea.md) - Statistical analysis of cluster differences
509 - [RAS Generator](ras-generator.md) - Generate clustering input data
510 - [Flux Simulation](flux-simulation.md) - Alternative clustering data source
511 - [Clustering Tutorial](../tutorials/clustering-analysis.md)
512 - [Validation Methods Reference](../tutorials/cluster-validation.md)