view COBRAxy/docs/tools/ras-generator.md @ 509:5956dcf94277 draft default tip

Uploaded
author francesco_lapi
date Wed, 01 Oct 2025 15:34:21 +0000
parents 4ed95023af20
children
line wrap: on
line source

# RAS Generator

Generate Reaction Activity Scores (RAS) from gene expression data and GPR (Gene-Protein-Reaction) rules.

## Overview

The RAS Generator computes metabolic reaction activity by:
1. Mapping gene expression to reactions via GPR rules
2. Applying logical operations (AND/OR) for enzyme complexes
3. Producing activity scores for each reaction in each sample

**Input**: Gene expression data + GPR rules  
**Output**: Reaction activity scores (RAS)

## Parameters

### Required Parameters

| Parameter | Short | Type | Description |
|-----------|--------|------|-------------|
| `--tool_dir` | `-td` | string | COBRAxy installation directory |
| `--input` | `-in` | file | Gene expression dataset (TSV format) |
| `--ras_output` | `-ra` | file | Output file for RAS values |
| `--rules_selector` | `-rs` | choice | Built-in model (ENGRO2, Recon, HMRcore) |

### Optional Parameters

| Parameter | Short | Type | Default | Description |
|-----------|--------|------|---------|-------------|
| `--none` | `-n` | boolean | true | Handle missing gene values |
| `--model_upload` | `-rl` | file | - | Custom GPR rules file |
| `--model_upload_name` | `-rn` | string | - | Custom model name |
| `--out_log` | - | file | log.txt | Output log file |

## Input Format

### Gene Expression File
```tsv
Gene_ID	Sample_1	Sample_2	Sample_3	Sample_4
HGNC:5	10.5	11.2	15.7	14.3
HGNC:10	3.2	4.1	8.8	7.9
HGNC:15	7.9	8.2	4.4	5.1
HGNC:25	12.1	13.5	18.2	17.8
```

**Requirements**:
- First column: Gene identifiers (HGNC, Ensembl, Entrez, etc.)
- Subsequent columns: Expression values (numeric)
- Header row with sample names
- Tab-separated format

### Custom GPR Rules File (Optional)
```tsv
Reaction_ID	GPR
R_HEX1	HGNC:4922
R_PGI	HGNC:8906
R_PFK	HGNC:8877 or HGNC:8878
R_ALDOA	HGNC:414 and HGNC:417
```

## Algorithm Details

### GPR Rule Processing

**Gene Mapping**: Each gene in the expression data is mapped to reactions via GPR rules.

**Logical Operations**:
- **OR**: `Gene1 or Gene2` → `max(expr1, expr2)` or `expr1 + expr2`
- **AND**: `Gene1 and Gene2` → `min(expr1, expr2)`

**Missing Gene Handling**:
- `-n true`: Missing genes treated as 0, OR operations continue
- `-n false`: Missing genes cause reaction score to be null

### RAS Computation

For each reaction and sample:

1. **Parse GPR rule** into nested logical structure
2. **Replace gene names** with expression values  
3. **Evaluate logical operations** recursively
4. **Assign RAS score** based on final result

**Example**:
```
GPR: (HGNC:5 and HGNC:10) or HGNC:15
Expression: HGNC:5=10.5, HGNC:10=3.2, HGNC:15=7.9
RAS = max(min(10.5, 3.2), 7.9) = max(3.2, 7.9) = 7.9
```

## Output Format

### RAS Values File
```tsv
Reactions	Sample_1	Sample_2	Sample_3	Sample_4
R_HEX1	8.5	9.2	12.1	11.3
R_PGI	7.3	8.1	6.4	7.2
R_PFK	15.2	16.8	20.1	18.9
R_ALDOA	3.2	4.1	4.4	5.1
```

**Format**:
- First column: Reaction identifiers
- Subsequent columns: RAS values for each sample
- Missing values represented as "None"

## Usage Examples

### Command Line

```bash
# Basic usage with built-in model
ras_generator -td /path/to/COBRAxy \
  -in expression_data.tsv \
  -ra ras_output.tsv \
  -rs ENGRO2

# With custom model and strict missing gene handling
ras_generator -td /path/to/COBRAxy \
  -in expression_data.tsv \
  -ra ras_output.tsv \
  -rl custom_rules.tsv \
  -rn "CustomModel" \
  -n false
```

### Python API

```python
import ras_generator

# Basic RAS generation
args = [
    '-td', '/path/to/COBRAxy',
    '-in', 'expression_data.tsv', 
    '-ra', 'ras_output.tsv',
    '-rs', 'ENGRO2'
]

ras_generator.main(args)
```

### Galaxy Usage

1. Upload gene expression file to Galaxy
2. Select **RAS Generator** from COBRAxy tools
3. Configure parameters:
   - **Input dataset**: Your expression file
   - **Rule selector**: ENGRO2 (or other model)
   - **Handle missing genes**: Yes/No
4. Click **Execute**

## Built-in Models

### ENGRO2 (Recommended for most analyses)
- **Scope**: Focused human metabolism
- **Reactions**: ~2,000
- **Genes**: ~500
- **Use case**: General metabolic analysis

### Recon (Comprehensive analysis)
- **Scope**: Complete human metabolism  
- **Reactions**: ~10,000
- **Genes**: ~2,000
- **Use case**: Detailed metabolic studies

### HMRcore (Balanced option)
- **Scope**: Core human metabolism
- **Reactions**: ~5,000  
- **Genes**: ~1,000
- **Use case**: Balanced coverage

## Gene ID Mapping

COBRAxy supports multiple gene identifier formats:

| Format | Example | Notes |
|--------|---------|--------|
| **HGNC ID** | HGNC:5 | Recommended, most stable |
| **HGNC Symbol** | ALDOA | Human-readable but may change |
| **Ensembl** | ENSG00000149925 | Version-specific |
| **Entrez** | 226 | Numeric identifier |

**Recommendation**: Use HGNC IDs for best compatibility and stability.



## Troubleshooting

### Common Issues

**"Gene not found" warnings**
```
Solution: Check gene ID format matches model expectations
- Verify gene identifiers (HGNC vs symbols vs Ensembl)
- Use gene mapping tools if needed
- Set -n true to handle missing genes gracefully
```

**"No computable scores" error**
```
Solution: Insufficient gene overlap between data and model
- Check gene ID format compatibility
- Verify expression file format
- Try different built-in model
```

**Empty output file**
```
Solution: Check input file format and permissions
- Ensure TSV format with proper headers
- Verify file paths are correct
- Check write permissions for output directory
```



### Debug Mode

Enable detailed logging:

```bash
ras_generator -td /path/to/COBRAxy \
  -in expression_data.tsv \
  -ra ras_output.tsv \
  -rs ENGRO2 \
  --out_log detailed_log.txt
```

Check log file for detailed error messages and processing statistics.

## Validation

### Check Output Quality

```python
import pandas as pd

# Read RAS output
ras_df = pd.read_csv('ras_output.tsv', sep='\t', index_col=0)

# Basic statistics
print(f"RAS matrix shape: {ras_df.shape}")
print(f"Non-null values: {ras_df.count().sum()}")
print(f"Value range: {ras_df.min().min():.2f} to {ras_df.max().max():.2f}")

# Check for problematic reactions
null_reactions = ras_df.isnull().all(axis=1).sum()
print(f"Reactions with no data: {null_reactions}")
```

### Expected Results

- **Coverage**: 60-90% of reactions should have computable scores
- **Range**: RAS values typically 0-20 for log-transformed expression
- **Distribution**: Should reflect biological variation in your samples

## Integration with Other Tools

### Downstream Analysis

RAS output can be used with:

- **[MAREA](marea.md)**: Statistical enrichment analysis
- **[RAS to Bounds](ras-to-bounds.md)**: Flux constraint application
- **[MAREA Cluster](marea-cluster.md)**: Sample clustering

### Preprocessing Options

Before RAS generation:
- **Normalize** expression data (log2, quantile, etc.)
- **Filter** low-expression genes
- **Batch correct** if multiple datasets

## Advanced Usage

### Custom Model Integration

```python
# Create custom GPR rules
custom_rules = {
    'R_CUSTOM1': 'HGNC:5 and HGNC:10',
    'R_CUSTOM2': 'HGNC:15 or HGNC:20'  
}

# Save as TSV
import pandas as pd
rules_df = pd.DataFrame(list(custom_rules.items()), 
                       columns=['Reaction_ID', 'GPR'])
rules_df.to_csv('custom_rules.tsv', sep='\t', index=False)

# Use with RAS generator
args = ['-rl', 'custom_rules.tsv', '-rn', 'CustomModel']
```

### Batch Processing

```python
# Process multiple expression files
expression_files = ['data1.tsv', 'data2.tsv', 'data3.tsv']

for i, exp_file in enumerate(expression_files):
    output_file = f'ras_output_{i}.tsv'
    
    args = [
        '-td', '/path/to/COBRAxy',
        '-in', exp_file,
        '-ra', output_file,
        '-rs', 'ENGRO2'
    ]
    
    ras_generator.main(args)
    print(f"Processed {exp_file} → {output_file}")
```

## References

- [COBRApy documentation](https://cobrapy.readthedocs.io/) - Underlying metabolic modeling
- [GPR rules format](https://cobrapy.readthedocs.io/en/stable/getting_started.html#gene-protein-reaction-rules) - Standard format specification  
- [HGNC database](https://www.genenames.org/) - Gene nomenclature standards