Introduction

This document is an addendum to the review “Artificial intelligence in bulk and single-cell RNA-sequencing data to foster precision oncology” (International Journal of Molecular Sciences, 2021). In this tutorial we will provide a toy example of how RNA-seq data can be analyzed through Machine Learning techniques to perform sample classification and gene signature extraction. In particular, we will use published RNA-seq data of samples belonging to two cancer types (i.e. prostate and lung) and normal samples belonging to the same tissue types to train and test Random Forest and Support Vector Machine (SVM) classifiers. The aim of this tutorial is to present an analysis workflow that includes some of the ML concepts discussed in the manuscript. The presented analysis is based on the R “caret” package and it was performed on R version 4.0.4.

Load and filter data

We will first load tumor and normal RNA-seq data of prostate and lung tissues. Tumor samples belong to The Cancer Genome Atlas (TCGA) project, while normal samples belong to the Genotype-Tissue Expression (GTEx) project. As discussed in the review, the first source of heterogeneity that must be taken into account developing a ML analysis on transcriptomic data is the one originated by technical factors. In our case, the major source of technical heterogeneity would be given by the fact that samples belong to two distinct projects that used different analysis pipelines. The use of batch-correction techniques would be needed to allow a joint analysis of samples of the two datasets. However, the data that we will use for our analysis have already been subjected to batch-correction to make the datasets comparable. These data were produced by Wang et al. (Scientific Data, 2018, https://doi.org/10.1038/sdata.2018.61) and downloaded from https://github.com/mskcc/RNAseqDB . Gene expression values from these datasets, normalized in Fragments Per Kilobase Million (FPKM), will be used as features by the ML models.

Note that the first two columns of each dataset contain gene-related annotations (Hugo and ENTREZ format).

##   Hugo_Symbol Entrez_Gene_Id TCGA.VN.A88P.01A.11R.A352.07
## 1        CUL5           8065                       218.79
## 2        REV1          51455                       420.68
## 3      ZNF428         126299                      1135.20
## 4      CX3CR1           1524                         5.36
## 5        FAXC          84553                       100.13
##   TCGA.CH.5739.01A.11R.1580.07
## 1                       329.84
## 2                       500.46
## 3                       599.49
## 4                        98.04
## 5                       226.54

Load gene set of interest

To reduce the number of possible features, we restrict our analysis to a known set of cancer-associated genes. We will use cancer genes annotated in NCG6 (Network of Cancer Genes 6) which is a comprehensive catalogue of experimentally-validated cancer genes by Repana et al. Genome Biology, 2019 (https://doi.org/10.1186/s13059-018-1612-0).

Tumor and Normal datasets are filtered on the set of cancer genes.

Separate training and test set

We randomly split each dataset in a training and a test set. The first one will be used by the model in the learning process to tune its parameters and achieve an optimal classification based on known labels (i.e. Tumor-Normal or PRAD-LUAD). The second one, containing samples that the model has not seen before, will be used to assess the performance of the trained model. The training set will contain 75% of samples, while the test set will contain the remaining 25%.

Tumor-Normal Classification

The aim of this section is to train ML models capable of discriminating tumor samples (prostate and lung adenocarcinoma) from samples of normal prostate and lung tissues. We will train a Random Forest and an SVM classifier, compare them on training samples and evaluate their performance on the test set. Finally, for each model, we will extract a putative signature of cancer genes whose expression most likely discriminates between tumor and normal samples.

Data preprocessing

As discussed in the manuscript, feature centering and scaling is a critical preprocessing step to assure that all variables proportionally contribute to the AI model. We then use the preProcess function of the “caret” package to jointly center and scale the training and test sets.

Explore dataset

Before moving to the learning step, we can visualise the entire dataset composed of training and test set using a dimensionality reduction method. We here propose the use of the t-distributed stochastic neighbor embedding (tSNE), but other techniques, such as PCA and UMAP, could be used as well. Given tSNE inherent stochasticity, a seed needs to be set to ensure the reproducibility of the visualisation.

Tumor-Normal Models

Random Forest: Train model

We will now define and train a Random Forest for tumor-normal classification. Firstly, we specify the formula containing the relation between features (genes’ expression) and sample type (tumor or normal).

Secondly, we define the resampling method to be used during the training of the model. In this case we use 10-fold cross-validation (cv).

The training set is finally fed to the Random Forest model (rf) and the learning step is performed. Attention! The following module could take several minutes to train model. Feel free to skip this module if you want and directly load the pre-trained model from the subsequent module.

The pre-trained model can be loaded from the Rdata folder.

The resulting trained model is:

## Random Forest 
## 
## 1012 samples
##  705 predictor
##    2 classes: 'Normal', 'Tumor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 911, 911, 911, 911, 910, 910, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##     2   0.9985152  0.9173387  0.9957143
##     3   0.9985188  0.9141129  0.9914079
##     7   0.9985886  0.9363911  0.9914079
##    14   0.9986058  0.9394153  0.9899793
##    27   0.9984030  0.9395161  0.9885300
##    52   0.9982279  0.9523185  0.9871014
##    99   0.9980934  0.9490927  0.9885300
##   191   0.9973625  0.9585685  0.9856522
##   367   0.9968737  0.9524194  0.9842236
##   704   0.9957053  0.9333669  0.9870600
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.

We can test the performance of the model on the training data comparing the predicted sample types with the true ones. Both the metrics used to evaluate the performance show a perfect classification capability.

## Accuracy    Kappa 
##        1        1

SVM: Train model

Following the same steps described for the Random Forest we now train a SVM for the same classification task.

The pre-trained model can be loaded from the Rdata folder.

Show the trained model and evaluate its performance on the training set.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1012 samples
##  705 predictor
##    2 classes: 'Normal', 'Tumor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 911, 911, 911, 911, 910, 910, ... 
## Resampling results across tuning parameters:
## 
##   C       ROC        Sens       Spec     
##     0.25  0.9975413  0.9712702  0.9770807
##     0.50  0.9988627  0.9745968  0.9856936
##     1.00  0.9995011  0.9809476  0.9871222
##     2.00  0.9995939  0.9841734  0.9900000
##     4.00  0.9997307  0.9937500  0.9914286
##     8.00  0.9997753  0.9937500  0.9928571
##    16.00  0.9997753  0.9937500  0.9928571
##    32.00  0.9997753  0.9937500  0.9928571
##    64.00  0.9997753  0.9937500  0.9928571
##   128.00  0.9997753  0.9937500  0.9928571
## 
## Tuning parameter 'sigma' was held constant at a value of 0.001607629
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.001607629 and C = 8.

The SVM achieves a perfect classification of samples in the training set.

## Accuracy    Kappa 
##        1        1

Compare models

Both models perform extremely well on the training set. However, to test whether one model performs significantly better than the other, a comparison based on three metrics (Sensitivity, Specificity and ROC) can be applied.

## 
## Call:
## summary.resamples(object = res)
## 
## Models: SVM, RF 
## Number of resamples: 10 
## 
## ROC 
##          Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## SVM 0.9982143 1.0000000 1.0000000 0.9997753 1.0000000    1    0
## RF  0.9955357 0.9982719 0.9993088 0.9986058 0.9995471    1    0
## 
## Sens 
##          Min. 1st Qu.    Median      Mean  3rd Qu.    Max. NA's
## SVM 0.9687500  1.0000 1.0000000 0.9937500 1.000000 1.00000    0
## RF  0.8064516  0.9375 0.9677419 0.9394153 0.968498 0.96875    0
## 
## Spec 
##          Min.   1st Qu.    Median      Mean 3rd Qu. Max. NA's
## SVM 0.9714286 0.9857143 1.0000000 0.9928571       1    1    0
## RF  0.9571429 0.9857143 0.9928571 0.9899793       1    1    0

## 
## Call:
## diff.resamples(x = res)
## 
## Models: SVM, RF 
## Metrics: ROC, Sens, Spec 
## Number of differences: 1 
## p-value adjustment: bonferroni
## 
## Call:
## summary.diff.resamples(object = difValues)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## ROC 
##     SVM     RF     
## SVM         0.00117
## RF  0.01611        
## 
## Sens 
##     SVM     RF     
## SVM         0.05433
## RF  0.01228        
## 
## Spec 
##     SVM    RF      
## SVM        0.002878
## RF  0.3416

Confirming what can be observed in the plots above, this result indicates a significantly higher Sensitivity in the SVM model with respect to Random Forest. Finally, the difference of the three metrics between the two models can be visualised.

The trained models are stored in new variables.

Random Forest: Test model

We can now test the classification performance of the trained Random Forest on the test set, which contains samples not used for training. Again, we compare predicted sample types with the true ones. The confusionMatrix function returns a more complete summary of the comparison, explicitly indicating the number of correctly classified or misclassified samples.

##  Accuracy     Kappa 
## 0.9821429 0.9577748
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Normal Tumor
##     Normal     99     1
##     Tumor       5   231
##                                           
##                Accuracy : 0.9821          
##                  95% CI : (0.9615, 0.9934)
##     No Information Rate : 0.6905          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9578          
##                                           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9519          
##             Specificity : 0.9957          
##          Pos Pred Value : 0.9900          
##          Neg Pred Value : 0.9788          
##              Prevalence : 0.3095          
##          Detection Rate : 0.2946          
##    Detection Prevalence : 0.2976          
##       Balanced Accuracy : 0.9738          
##                                           
##        'Positive' Class : Normal          
## 

The above results indicate that only six samples of the test set were misclassified by the Random Forest. The following plots show the tSNE representation of the entire dataset, with samples belonging to the training set shown in grey and samples of the test set colored either based on the true labels (left) or on the predicted ones (right). The black cross indicates the position of the misclassified sample.

SVM: Test model

As done for the Random Forest we now test the classification performance of the SVM on the test set.

## Accuracy    Kappa 
##        1        1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Normal Tumor
##     Normal    104     0
##     Tumor       0   232
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9891, 1)
##     No Information Rate : 0.6905     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3095     
##          Detection Rate : 0.3095     
##    Detection Prevalence : 0.3095     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : Normal     
## 

This model achieves a perfect classification of samples in the test set, as shown by the above results and plots below.

Compare important features

Finally, once shown the high classification capability of both models on unseed data, we extract the features (genes) that have the highest importance for classification. For each model we select the top 20 genes based on importance metrics. These gene sets constitute putative signatures of genes involved in discriminating tumor and normal samples.

SVM most important features

Ten genes are identified among the top 20 important features by both models. In a real research setting, these consensus genes would likely be prioritized for further investigation.

##  [1] "BMPR1A" "SIX2"   "CCND2"  "FOXO1"  "EED"    "MYCN"   "HLF"    "GAS7"  
##  [9] "STAT5B" "MALT1"

PRAD-LUAD Classification

In this section we will use Random Forest and SVM models to classify samples of two tumor types: prostate and lung adenocarcinoma (PRAD and LUAD, respectively). We will follow the same steps described in the previous section. It is worth noting that, given the strong molecular differences between prostate and lung tissue, this classification task is expected to be easy for the models.

Create full test set matrix

We create the full test set matrix, including PRAD and LUAD samples, that will be used for the evaluation of the model performance on unseen data.

Explore dataset

Before going into the learning step, we can visualise the entire dataset, composed of training and test set, using the dimensionality reduction method tSNE. Given tSNE inherent stochasticity, a seed needs to be set to ensure the reproducibility of the visualisation.

Tumor-Tumor Models

Random Forest: Train model

We define and train a Random Forest for PRAD-LUAD classification. Firstly, we specify the formula containing the relation between features (genes’ expression) and sample type (PRAD or LUAD).

Secondly, we define the resampling method to be used during the training of the model. In this case we use 10-fold cross-validation (cv).

The training set is finally fed to the Random Forest model (rf) and the learning step is performed. Attention! The following module could take several minutes to train model. Feel free to skip this module if you want and directly load the pre-trained model from the subsequent module.

The pre-trained model can be loaded from the Rdata folder.

The resulting trained model is:

## Random Forest 
## 
## 697 samples
## 705 predictors
##   2 classes: 'PRAD', 'LUAD' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 628, 628, 627, 627, 627, 627, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC  Sens  Spec
##     2   1    1     1   
##     3   1    1     1   
##     7   1    1     1   
##    14   1    1     1   
##    27   1    1     1   
##    52   1    1     1   
##    99   1    1     1   
##   191   1    1     1   
##   367   1    1     1   
##   704   1    1     1   
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

We can test the performance of the model on the training data comparing the predicted sample types with the true ones. Both the metrics used to evaluate the performance indicate a perfect classification on the training set.

## Accuracy    Kappa 
##        1        1

SVM: Train model

Following the same steps described for the Random Forest we train a SVM for the same classification task.

Show the trained model and evaluate its performance on the training set.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 697 samples
## 705 predictors
##   2 classes: 'PRAD', 'LUAD' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 628, 628, 627, 627, 627, 627, ... 
## Resampling results across tuning parameters:
## 
##   C       ROC  Sens  Spec
##     0.25  1    1     1   
##     0.50  1    1     1   
##     1.00  1    1     1   
##     2.00  1    1     1   
##     4.00  1    1     1   
##     8.00  1    1     1   
##    16.00  1    1     1   
##    32.00  1    1     1   
##    64.00  1    1     1   
##   128.00  1    1     1   
## 
## Tuning parameter 'sigma' was held constant at a value of 0.001650364
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.001650364 and C = 0.25.
## Accuracy    Kappa 
##        1        1

Also the SVM achieves a perfect classification of the training set.

Compare models

As shown above, both models achieve a perfect classification of the training set. The comparison below confirms this point, showing no differences of the evaluation metrics between the two models.

## 
## Call:
## summary.resamples(object = res)
## 
## Models: SVM, RF 
## Number of resamples: 10 
## 
## ROC 
##     Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM    1       1      1    1       1    1    0
## RF     1       1      1    1       1    1    0
## 
## Sens 
##     Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM    1       1      1    1       1    1    0
## RF     1       1      1    1       1    1    0
## 
## Spec 
##     Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVM    1       1      1    1       1    1    0
## RF     1       1      1    1       1    1    0

## 
## Call:
## diff.resamples(x = res)
## 
## Models: SVM, RF 
## Metrics: ROC, Sens, Spec 
## Number of differences: 1 
## p-value adjustment: bonferroni
## 
## Call:
## summary.diff.resamples(object = difValues)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## ROC 
##     SVM RF
## SVM      0
## RF  NA    
## 
## Sens 
##     SVM RF
## SVM      0
## RF  NA    
## 
## Spec 
##     SVM RF
## SVM      0
## RF  NA

Random Forest: Test model

We can now test the classification performance of the trained Random Forest on the test set, which contains samples not used for training. Again, we compare predicted sample types with the true ones. The confusionMatrix function returns a more complete summary of the comparison, explicitly indicating the number of correctly classified or misclassified samples.

## Accuracy    Kappa 
##        1        1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction PRAD LUAD
##       PRAD  106    0
##       LUAD    0  126
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9842, 1)
##     No Information Rate : 0.5431     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4569     
##          Detection Rate : 0.4569     
##    Detection Prevalence : 0.4569     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : PRAD       
## 

The Random Forest achieves a perfect classification also of the test set, with no misclassified samples.

SVM: Test model

As done for the Random Forest we now test the classification performance of the SVM on the test set.

## Accuracy    Kappa 
##        1        1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction PRAD LUAD
##       PRAD  106    0
##       LUAD    0  126
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9842, 1)
##     No Information Rate : 0.5431     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4569     
##          Detection Rate : 0.4569     
##    Detection Prevalence : 0.4569     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : PRAD       
## 

Also the SVM achieves a perfect classification of the test set.

Compare important features

Finally, we can extract the features (genes) that have the highest importance for classification. These gene sets constitute putative signatures of genes involved in discriminating the two tumor types.

SVM most important features

Based on the ROC metric, that measures feature importance for the SVM model, 108 genes have the same importance=1. The plot below shows a random subset of 20 of these genes.

We finally intersect the top 20 important genes of the Random Forest, with the 108 genes with importance=1 in the SVM. 13 genes are identified in both models, suggesting their role in discriminating PRAD and LUAD samples.

##  [1] "SSX4"   "H3F3A"  "TCF12"  "B2M"    "PRDM1"  "TRIM27" "KDM5A"  "FAT1"  
##  [9] "RGPD3"  "MAPK1"  "IKZF1"  "NUTM2A" "DGCR8"