Practical project

For the project, you will analyze omics data from TCGA (The Cancer Genome Atlas), selecting a tumor type of your choice.

The goal is to explore and visualize key patterns in the selected dataset, applying bioinformatics techniques to extract meaningful insights.

The exam project requires you to create an Rmarkdown report.

You must submit both the the html and the .Rmd files of the report.

In the report, you can choose whether to display the code or hide it.

Make sure your report is well-structured and well-presented (adjust figure sizes, write clear explanations and so on)


You can choose to focus on either somatic mutation analysis or gene expression analysis.

You will need to complete the tasks by manipulating data and creating figures.

Include at least one chunk of code for each task.

Before each chunk, state the question you are addressing and provide a brief explanation of what you do, along with the type of plot you have chosen.


Before starting, follow these steps:

  1. Choose a tumor type and the correspondent TCGA code (You can search on TCGA or see the lesson on databases)
  2. Download data
    Keep in mind that this process may take time and require significant storage space on your computer.
    A temporary folder will be created in your working directory, so once you have cleaned the data, make sure to delete it.

Notice: We have chosen STAD tumor. Whenever you see STAD, replace it with the tumor code you selected.


A) Somatic mutation analysis

In this analysis, you will explore somatic mutations in cancer patients using mutation profile data from TCGA for a selected tumor type.

This analysis aims to (1) examine the most frequent mutations and their distribution across patients, (2) assess mutational patterns, including their genomic locations and variant classifications, and (3) investigate their potential impact on gene function.

By examining these aspects, this analysis will provide insights into the mutational landscape of the chosen cancer type and its implications for disease progression.

WES-seq data were processed as described here. Be aware of the GENCODE annotation version used.

Tasks

  1. Count the total number of unique patients and the total number of unique mutations.
    Create a plot showing, for each patient, the number of SNP, DEL, INS, ONP, etc. identified.
    If you have many patients, you may choose to adjust the plot dimensions for better visibility or display distributions instead of exact counts.

  2. Create a plot showing, for each patient, the total number of identified somatic variations divided by Variant_Classification types.
    If the number of patients is large, consider resizing the plot or using distributions instead of exact values.

  3. Exclude mutations in the following categories: 3'Flank, 3'UTR, Silent, Intron, 5'Flank, 5'UTR, IGR.
    Evaluate in how many patients each mutation is present and select the 25 most common mutations.
    Create a plot showing the total number of patients in which each mutation has been identified.

  • If no mutations are shared, consider analyzing positions or genes instead of exact mutations.
  • If fewer than 25 shared mutations exist, use the available ones
  1. Focusing on the most frequent somatic mutations identified in step 3, create a plot representing the variant allele frequency (VAF) distribution for each mutation across samples.

  2. Visualize the genomic locations of the mutations identified in step 3.

  3. Select the 50 most frequent exonic mutations.
    Retrieve the sequence of the correspondent exons and evaluate the nucleotide frequencies in the exons to which the selected mutations belong.

  4. Search for the 10 most frequent mutations in CancerVar and create a plot showing their classification.
    (A Sankey diagram could be a good option).

  5. Evaluate how many samples have a mutation in a given gene (if a gene has multiple mutations in a sample, count it only once).
    Plot the top 25 most frequently mutated genes, showing how many mutations each has.

  6. Create a g3viz plot using data from cBioPortal for a gene of interest (selected from step 8) in the tumor type under analysis.


In the following chunks, you will find instructions on using TCGAbiolinks to download data. Include this step in your report, but make sure not to evaluate it every time (adjust the chunk options accordingly).

BiocManager::install("TCGAbiolinks")

options(stringsAsFactors = F)
library(TCGAbiolinks)
library(SummarizedExperiment)


# Somatic mutations ==============

query <- GDCquery(
  project = "TCGA-STAD",
  data.category = "Simple Nucleotide Variation", 
  data.type = "Masked Somatic Mutation", 
  workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking",
  access = "open"
)
GDCdownload(query)
maf <- GDCprepare(query)

mutv = as.data.frame(maf)
saveRDS(mut, "Exam/STAD_Muts.rds")

B) Gene expression analysis

This project focuses on transcriptomic data, specifically analyzing differential gene expression between tumor and normal samples.

The goal is to (1) identify significantly dysregulated genes, (2) perform pathway enrichment analyses, and (3) explore functional implications and potential therapeutic targets based on gene expression patterns.

By applying statistical and bioinformatics methods, this analysis aims to uncover key transcriptional alterations associated with cancer and their potential role in tumor progression.

RNA-seq data were processed as described here. Be aware of the GENCODE annotation version used.

Tasks

You can decide either to use whole-transcriptome or instead select only protein-coding genes.

  1. Perform quality control checks (e.g., heatmap, PCA).

  2. Conduct a differential expression analysis between tumor and normal samples (volcano plot, identification of up-regulated and down-regulated genes).

  3. Perform over-representation analysis using at least two different databases (e.g., KEGG, Hallmark, GO Biological Processes - see msigdbr) and create lollipop plots.

  4. For each over-representation analysis, choose the most significant pathway and create a heatmap representing the expression of differentially expressed genes (up-regulated or down-regulated) in that pathway across samples.
    To do this:

    • Identify the top-ranked pathway for upregulated genes.
    • Retrieve the genes associated with that pathway.
    • Extract the normalized expression values from the DESeq object (counts(deseqobj, normalized=TRUE)).
    • Create a new dataframe containing only the normalized expression values for upregulated genes in the selected pathway and generate a heatmap (consider applying z-score transformation using the scale() function to enhance visibility).
  5. Repeat the process at step 4 for down-regulated genes.

  6. Select the 25 most significant genes (those with the lowest adjusted p-value) and generate a STRING plot (check the databases lesson) to assess whether the proteins encoded by these genes interact.
    Create it online and insert the image in your report.


In the following chunk there are some indications to use TCGAbiolinks to download data. Include this step in the report, but remember to not evaluate it each time (change options to the chunk).

BiocManager::install("TCGAbiolinks")

options(stringsAsFactors = F)
library(TCGAbiolinks)
library(SummarizedExperiment)

# Expression ==============

query.exp.hg38 <- GDCquery(
  project = "TCGA-STAD", 
  data.category = "Transcriptome Profiling", 
  data.type = "Gene Expression Quantification", 
  workflow.type = "STAR - Counts"
)

GDCdownload(query.exp.hg38)

expdat <- GDCprepare(
  query = query.exp.hg38
)

mrna = assay(expdat)


saveRDS(mrna, "Exam/STAD_gene_Expression.rds")
saveRDS(expdat, "Exam/STAD_gene_Expression_metadata.rds")


Examples

In the following chunks some advice to start understanding the data. This part, if you want to follow it, has to be included in your report:

Somatic mutation analysis

options(stringsAsFactors = F)

# Somatic mutations =====
mut = readRDS("Exam/STAD_Muts.rds")

mut2 = unique(mut[, c("Chromosome", "Start_Position", "End_Position", "Strand"
                      , "Hugo_Symbol", "Variant_Type", "Variant_Classification"
                      , "Reference_Allele", "Tumor_Seq_Allele2", "Tumor_Sample_Barcode"
                      , "dbSNP_RS", "HGVSc", "HGVSp", "t_depth", "t_alt_count", "Existing_variation")
                  ])

head(mut2)
##   Chromosome Start_Position End_Position Strand Hugo_Symbol Variant_Type
## 1       chr1         970353       970353      +     PLEKHN1          SNP
## 2       chr1        1485791      1485791      +      ATAD3B          SNP
## 3       chr1        1489240      1489240      +      ATAD3B          SNP
## 4       chr1        1627817      1627817      +        MIB2          SNP
## 5       chr1        2521294      2521294      +       PANK4          SNP
## 6       chr1        3432008      3432008      +      PRDM16          SNP
##   Variant_Classification Reference_Allele Tumor_Seq_Allele2
## 1      Missense_Mutation                G                 T
## 2      Missense_Mutation                C                 T
## 3                  3'UTR                G                 A
## 4                 Silent                C                 T
## 5      Missense_Mutation                G                 A
## 6                 Silent                G                 A
##           Tumor_Sample_Barcode     dbSNP_RS     HGVSc       HGVSp t_depth
## 1 TCGA-CG-4460-01A-01D-1158-08         <NA>  c.260G>T  p.Arg87Met      77
## 2 TCGA-CG-4460-01A-01D-1158-08  rs759930254  c.598C>T p.Arg200Trp      48
## 3 TCGA-CG-4460-01A-01D-1158-08  rs371673051  c.*91G>A        <NA>     142
## 4 TCGA-CG-4460-01A-01D-1158-08  rs778162188 c.2013C>T   p.Asp671=      26
## 5 TCGA-CG-4460-01A-01D-1158-08 rs1397591787  c.229C>T  p.Pro77Ser     171
## 6 TCGA-CG-4460-01A-01D-1158-08 rs1439452609 c.3564G>A  p.Pro1188=      83
##   t_alt_count         Existing_variation
## 1          11              COSV100379358
## 2          24  rs759930254;COSV100426689
## 3          46                rs371673051
## 4           6  rs778162188;COSV100599110
## 5          35 rs1397591787;COSV101022594
## 6          20  rs1439452609;COSV99579583
# key uniquely identifies a mutation
mut2$key = paste0(mut2$Chromosome, "_"
                  , mut2$Start_Position, "_"
                  , mut2$End_Position, "_"
                  , mut2$Strand, "_"
                  , mut2$Hugo_Symbol, "_"
                  , mut2$Reference_Allele, "_"
                  , mut2$Tumor_Seq_Allele2)

head(mut2$key)
## [1] "chr1_970353_970353_+_PLEKHN1_G_T"  "chr1_1485791_1485791_+_ATAD3B_C_T"
## [3] "chr1_1489240_1489240_+_ATAD3B_G_A" "chr1_1627817_1627817_+_MIB2_C_T"  
## [5] "chr1_2521294_2521294_+_PANK4_G_A"  "chr1_3432008_3432008_+_PRDM16_G_A"
# Computing VAF 
mut2$Variant_Allele_Frequency = mut2$t_alt_count/mut2$t_depth
saveRDS(mut2, "Exam/STAD_Muts_reduced.rds")

Gene expression analysis

The metadata contains information about each sample. As discussed in the database lesson, the barcode can be used to identify the sample type and distinguish between Tumor and Normal samples.
Important! Not all TCGA projects include Normal samples. Before starting your analysis, check that your dataset contains both sample types.

# Expression  ==============
ex = readRDS("Exam/STAD_gene_Expression.rds")
dim(ex) # rownames contain genes, colnames contain samples
## [1] 60660   448
ex[1:10,1:5]
##                    TCGA-CG-4444-01A-01R-1157-13 TCGA-BR-8059-01A-11R-2343-13
## ENSG00000000003.15                         2190                         1845
## ENSG00000000005.6                             2                            0
## ENSG00000000419.13                         3761                         1991
## ENSG00000000457.14                          757                          850
## ENSG00000000460.17                          784                          375
## ENSG00000000938.13                          659                          368
## ENSG00000000971.16                         3539                         2646
## ENSG00000001036.14                         3112                         4250
## ENSG00000001084.13                         2876                         2842
## ENSG00000001167.14                         1904                         2811
##                    TCGA-CG-4438-01A-01R-1157-13 TCGA-VQ-A8PB-01A-11R-A39E-31
## ENSG00000000003.15                         3921                         2681
## ENSG00000000005.6                             6                            1
## ENSG00000000419.13                         6130                         3132
## ENSG00000000457.14                         1106                         1175
## ENSG00000000460.17                          927                         1226
## ENSG00000000938.13                          664                          446
## ENSG00000000971.16                         3049                         3652
## ENSG00000001036.14                         2452                         5699
## ENSG00000001084.13                         2133                         4963
## ENSG00000001167.14                         2323                         4387
##                    TCGA-BR-6710-01A-11R-1884-13
## ENSG00000000003.15                          855
## ENSG00000000005.6                             4
## ENSG00000000419.13                          921
## ENSG00000000457.14                          483
## ENSG00000000460.17                           49
## ENSG00000000938.13                          158
## ENSG00000000971.16                         1203
## ENSG00000001036.14                         1176
## ENSG00000001084.13                         3421
## ENSG00000001167.14                          466
metadata = readRDS("Exam/STAD_gene_Expression_metadata.rds")

info_samples = data.frame(barcode = metadata@colData@listData$barcode
                          , patient = metadata@colData@listData$patient
                          , sample = metadata@colData@listData$sample
                          , definition = metadata@colData@listData$definition
                          , sample_type_id = metadata@colData@listData$sample_type_id
                          , tissue_type = metadata@colData@listData$tissue_type)


head(info_samples,n = 10)
##                         barcode      patient           sample
## 1  TCGA-D7-6815-01A-11R-1884-13 TCGA-D7-6815 TCGA-D7-6815-01A
## 2  TCGA-IN-A7NT-01A-21R-A354-31 TCGA-IN-A7NT TCGA-IN-A7NT-01A
## 3  TCGA-BR-8365-01A-21R-2343-13 TCGA-BR-8365 TCGA-BR-8365-01A
## 4  TCGA-B7-A5TJ-01A-11R-A31P-31 TCGA-B7-A5TJ TCGA-B7-A5TJ-01A
## 5  TCGA-CD-8535-01A-11R-2343-13 TCGA-CD-8535 TCGA-CD-8535-01A
## 6  TCGA-VQ-A8PK-01A-12R-A414-31 TCGA-VQ-A8PK TCGA-VQ-A8PK-01A
## 7  TCGA-VQ-AA6D-01A-11R-A414-31 TCGA-VQ-AA6D TCGA-VQ-AA6D-01A
## 8  TCGA-CG-5721-01A-11R-1602-13 TCGA-CG-5721 TCGA-CG-5721-01A
## 9  TCGA-CG-5721-11A-01R-1602-13 TCGA-CG-5721 TCGA-CG-5721-11A
## 10 TCGA-FP-8099-01A-11R-2343-13 TCGA-FP-8099 TCGA-FP-8099-01A
##             definition sample_type_id tissue_type
## 1  Primary solid Tumor             01       Tumor
## 2  Primary solid Tumor             01       Tumor
## 3  Primary solid Tumor             01       Tumor
## 4  Primary solid Tumor             01       Tumor
## 5  Primary solid Tumor             01       Tumor
## 6  Primary solid Tumor             01       Tumor
## 7  Primary solid Tumor             01       Tumor
## 8  Primary solid Tumor             01       Tumor
## 9  Solid Tissue Normal             11      Normal
## 10 Primary solid Tumor             01       Tumor
table(info_samples$tissue_type)
## 
## Normal  Tumor 
##     36    412




Paper Review

At the end, you will need to read a research paper related to the type of cancer you have chosen.

  • The paper must: - Be published in a high-impact journal (Nature, Science, Cell).
  • Be recent (published within the last 3 years).
  • Be a research article (not a review).
  • Include bioinformatics analyses (gene expression, somatic mutation evaluation, CNVs).
  • Contain plots similar to those covered in the lesson