For the project, you will analyze omics data from TCGA (The Cancer Genome Atlas), selecting a tumor type of your choice.
The goal is to explore and visualize key patterns in the selected dataset, applying bioinformatics techniques to extract meaningful insights.
The exam project requires you to create an Rmarkdown
report.
You must submit both the the html
and the
.Rmd
files of the report.
In the report, you can choose whether to display the code or hide it.
Make sure your report is well-structured and well-presented (adjust figure sizes, write clear explanations and so on)
You can choose to focus on either somatic mutation analysis or gene expression analysis.
You will need to complete the tasks by manipulating data and creating figures.
Include at least one chunk of code for each task.
Before each chunk, state the question you are addressing and provide a brief explanation of what you do, along with the type of plot you have chosen.
Before starting, follow these steps:
Notice: We have chosen STAD tumor. Whenever you see
STAD
, replace it with the tumor code you selected.
In this analysis, you will explore somatic mutations in cancer patients using mutation profile data from TCGA for a selected tumor type.
This analysis aims to (1) examine the most frequent mutations and their distribution across patients, (2) assess mutational patterns, including their genomic locations and variant classifications, and (3) investigate their potential impact on gene function.
By examining these aspects, this analysis will provide insights into the mutational landscape of the chosen cancer type and its implications for disease progression.
WES-seq data were processed as described here. Be aware of the GENCODE annotation version used.
Count the total number of unique patients and the total number of
unique mutations.
Create a plot showing, for each patient, the number of SNP
,
DEL
, INS
, ONP
, etc.
identified.
If you have many patients, you may choose to adjust the plot dimensions
for better visibility or display distributions instead of exact
counts.
Create a plot showing, for each patient, the total number of
identified somatic variations divided by
Variant_Classification
types.
If the number of patients is large, consider resizing the plot or using
distributions instead of exact values.
Exclude mutations in the following categories:
3'Flank
, 3'UTR
, Silent
,
Intron
, 5'Flank
, 5'UTR
,
IGR
.
Evaluate in how many patients each mutation is present and select the 25
most common mutations.
Create a plot showing the total number of patients in which each
mutation has been identified.
Focusing on the most frequent somatic mutations identified in
step 3
, create a plot representing the variant
allele frequency (VAF) distribution for each mutation across
samples.
Visualize the genomic locations of the mutations identified in step 3.
Select the 50 most frequent exonic
mutations.
Retrieve the sequence of the correspondent exons and evaluate the
nucleotide frequencies in the exons to which the selected mutations
belong.
Search for the 10 most frequent mutations in
CancerVar and create a plot showing their
classification.
(A Sankey diagram could be a good option).
Evaluate how many samples have a mutation in a given gene (if a
gene has multiple mutations in a sample, count it only once).
Plot the top 25 most frequently mutated genes, showing how many
mutations each has.
Create a g3viz plot using data from cBioPortal for a gene of interest (selected from step 8) in the tumor type under analysis.
In the following chunks, you will find instructions on using TCGAbiolinks to download data. Include this step in your report, but make sure not to evaluate it every time (adjust the chunk options accordingly).
BiocManager::install("TCGAbiolinks")
options(stringsAsFactors = F)
library(TCGAbiolinks)
library(SummarizedExperiment)
# Somatic mutations ==============
query <- GDCquery(
project = "TCGA-STAD",
data.category = "Simple Nucleotide Variation",
data.type = "Masked Somatic Mutation",
workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking",
access = "open"
)
GDCdownload(query)
maf <- GDCprepare(query)
mutv = as.data.frame(maf)
saveRDS(mut, "Exam/STAD_Muts.rds")
This project focuses on transcriptomic data, specifically analyzing differential gene expression between tumor and normal samples.
The goal is to (1) identify significantly dysregulated genes, (2) perform pathway enrichment analyses, and (3) explore functional implications and potential therapeutic targets based on gene expression patterns.
By applying statistical and bioinformatics methods, this analysis aims to uncover key transcriptional alterations associated with cancer and their potential role in tumor progression.
RNA-seq data were processed as described here. Be aware of the GENCODE annotation version used.
You can decide either to use whole-transcriptome or instead select only protein-coding genes.
Perform quality control checks (e.g., heatmap, PCA).
Conduct a differential expression analysis between tumor and normal samples (volcano plot, identification of up-regulated and down-regulated genes).
Perform over-representation analysis using at
least two different databases (e.g., KEGG
,
Hallmark
, GO Biological Processes
- see
msigdbr
) and create lollipop plots.
For each over-representation analysis, choose the most
significant pathway and create a heatmap representing
the expression of differentially expressed genes (up-regulated or
down-regulated) in that pathway across samples.
To do this:
counts(deseqobj, normalized=TRUE)
).scale()
function to
enhance visibility).Repeat the process at step 4 for down-regulated genes.
Select the 25 most significant genes (those with the
lowest adjusted p-value) and generate a
STRING plot
(check the databases lesson) to assess whether
the proteins encoded by these genes interact.
Create it online and insert the image in your report.
In the following chunk there are some indications to use
TCGAbiolinks
to download data. Include this step in the
report, but remember to not evaluate it each time (change options to the
chunk).
BiocManager::install("TCGAbiolinks")
options(stringsAsFactors = F)
library(TCGAbiolinks)
library(SummarizedExperiment)
# Expression ==============
query.exp.hg38 <- GDCquery(
project = "TCGA-STAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts"
)
GDCdownload(query.exp.hg38)
expdat <- GDCprepare(
query = query.exp.hg38
)
mrna = assay(expdat)
saveRDS(mrna, "Exam/STAD_gene_Expression.rds")
saveRDS(expdat, "Exam/STAD_gene_Expression_metadata.rds")
In the following chunks some advice to start understanding the data. This part, if you want to follow it, has to be included in your report:
options(stringsAsFactors = F)
# Somatic mutations =====
mut = readRDS("Exam/STAD_Muts.rds")
mut2 = unique(mut[, c("Chromosome", "Start_Position", "End_Position", "Strand"
, "Hugo_Symbol", "Variant_Type", "Variant_Classification"
, "Reference_Allele", "Tumor_Seq_Allele2", "Tumor_Sample_Barcode"
, "dbSNP_RS", "HGVSc", "HGVSp", "t_depth", "t_alt_count", "Existing_variation")
])
head(mut2)
## Chromosome Start_Position End_Position Strand Hugo_Symbol Variant_Type
## 1 chr1 970353 970353 + PLEKHN1 SNP
## 2 chr1 1485791 1485791 + ATAD3B SNP
## 3 chr1 1489240 1489240 + ATAD3B SNP
## 4 chr1 1627817 1627817 + MIB2 SNP
## 5 chr1 2521294 2521294 + PANK4 SNP
## 6 chr1 3432008 3432008 + PRDM16 SNP
## Variant_Classification Reference_Allele Tumor_Seq_Allele2
## 1 Missense_Mutation G T
## 2 Missense_Mutation C T
## 3 3'UTR G A
## 4 Silent C T
## 5 Missense_Mutation G A
## 6 Silent G A
## Tumor_Sample_Barcode dbSNP_RS HGVSc HGVSp t_depth
## 1 TCGA-CG-4460-01A-01D-1158-08 <NA> c.260G>T p.Arg87Met 77
## 2 TCGA-CG-4460-01A-01D-1158-08 rs759930254 c.598C>T p.Arg200Trp 48
## 3 TCGA-CG-4460-01A-01D-1158-08 rs371673051 c.*91G>A <NA> 142
## 4 TCGA-CG-4460-01A-01D-1158-08 rs778162188 c.2013C>T p.Asp671= 26
## 5 TCGA-CG-4460-01A-01D-1158-08 rs1397591787 c.229C>T p.Pro77Ser 171
## 6 TCGA-CG-4460-01A-01D-1158-08 rs1439452609 c.3564G>A p.Pro1188= 83
## t_alt_count Existing_variation
## 1 11 COSV100379358
## 2 24 rs759930254;COSV100426689
## 3 46 rs371673051
## 4 6 rs778162188;COSV100599110
## 5 35 rs1397591787;COSV101022594
## 6 20 rs1439452609;COSV99579583
# key uniquely identifies a mutation
mut2$key = paste0(mut2$Chromosome, "_"
, mut2$Start_Position, "_"
, mut2$End_Position, "_"
, mut2$Strand, "_"
, mut2$Hugo_Symbol, "_"
, mut2$Reference_Allele, "_"
, mut2$Tumor_Seq_Allele2)
head(mut2$key)
## [1] "chr1_970353_970353_+_PLEKHN1_G_T" "chr1_1485791_1485791_+_ATAD3B_C_T"
## [3] "chr1_1489240_1489240_+_ATAD3B_G_A" "chr1_1627817_1627817_+_MIB2_C_T"
## [5] "chr1_2521294_2521294_+_PANK4_G_A" "chr1_3432008_3432008_+_PRDM16_G_A"
# Computing VAF
mut2$Variant_Allele_Frequency = mut2$t_alt_count/mut2$t_depth
saveRDS(mut2, "Exam/STAD_Muts_reduced.rds")
The metadata contains information about each sample. As discussed in
the database lesson, the barcode can be used to
identify the sample type and distinguish between Tumor and
Normal samples.
Important! Not all TCGA projects include Normal
samples. Before starting your analysis, check that your dataset contains
both sample types.
# Expression ==============
ex = readRDS("Exam/STAD_gene_Expression.rds")
dim(ex) # rownames contain genes, colnames contain samples
## [1] 60660 448
ex[1:10,1:5]
## TCGA-CG-4444-01A-01R-1157-13 TCGA-BR-8059-01A-11R-2343-13
## ENSG00000000003.15 2190 1845
## ENSG00000000005.6 2 0
## ENSG00000000419.13 3761 1991
## ENSG00000000457.14 757 850
## ENSG00000000460.17 784 375
## ENSG00000000938.13 659 368
## ENSG00000000971.16 3539 2646
## ENSG00000001036.14 3112 4250
## ENSG00000001084.13 2876 2842
## ENSG00000001167.14 1904 2811
## TCGA-CG-4438-01A-01R-1157-13 TCGA-VQ-A8PB-01A-11R-A39E-31
## ENSG00000000003.15 3921 2681
## ENSG00000000005.6 6 1
## ENSG00000000419.13 6130 3132
## ENSG00000000457.14 1106 1175
## ENSG00000000460.17 927 1226
## ENSG00000000938.13 664 446
## ENSG00000000971.16 3049 3652
## ENSG00000001036.14 2452 5699
## ENSG00000001084.13 2133 4963
## ENSG00000001167.14 2323 4387
## TCGA-BR-6710-01A-11R-1884-13
## ENSG00000000003.15 855
## ENSG00000000005.6 4
## ENSG00000000419.13 921
## ENSG00000000457.14 483
## ENSG00000000460.17 49
## ENSG00000000938.13 158
## ENSG00000000971.16 1203
## ENSG00000001036.14 1176
## ENSG00000001084.13 3421
## ENSG00000001167.14 466
metadata = readRDS("Exam/STAD_gene_Expression_metadata.rds")
info_samples = data.frame(barcode = metadata@colData@listData$barcode
, patient = metadata@colData@listData$patient
, sample = metadata@colData@listData$sample
, definition = metadata@colData@listData$definition
, sample_type_id = metadata@colData@listData$sample_type_id
, tissue_type = metadata@colData@listData$tissue_type)
head(info_samples,n = 10)
## barcode patient sample
## 1 TCGA-D7-6815-01A-11R-1884-13 TCGA-D7-6815 TCGA-D7-6815-01A
## 2 TCGA-IN-A7NT-01A-21R-A354-31 TCGA-IN-A7NT TCGA-IN-A7NT-01A
## 3 TCGA-BR-8365-01A-21R-2343-13 TCGA-BR-8365 TCGA-BR-8365-01A
## 4 TCGA-B7-A5TJ-01A-11R-A31P-31 TCGA-B7-A5TJ TCGA-B7-A5TJ-01A
## 5 TCGA-CD-8535-01A-11R-2343-13 TCGA-CD-8535 TCGA-CD-8535-01A
## 6 TCGA-VQ-A8PK-01A-12R-A414-31 TCGA-VQ-A8PK TCGA-VQ-A8PK-01A
## 7 TCGA-VQ-AA6D-01A-11R-A414-31 TCGA-VQ-AA6D TCGA-VQ-AA6D-01A
## 8 TCGA-CG-5721-01A-11R-1602-13 TCGA-CG-5721 TCGA-CG-5721-01A
## 9 TCGA-CG-5721-11A-01R-1602-13 TCGA-CG-5721 TCGA-CG-5721-11A
## 10 TCGA-FP-8099-01A-11R-2343-13 TCGA-FP-8099 TCGA-FP-8099-01A
## definition sample_type_id tissue_type
## 1 Primary solid Tumor 01 Tumor
## 2 Primary solid Tumor 01 Tumor
## 3 Primary solid Tumor 01 Tumor
## 4 Primary solid Tumor 01 Tumor
## 5 Primary solid Tumor 01 Tumor
## 6 Primary solid Tumor 01 Tumor
## 7 Primary solid Tumor 01 Tumor
## 8 Primary solid Tumor 01 Tumor
## 9 Solid Tissue Normal 11 Normal
## 10 Primary solid Tumor 01 Tumor
table(info_samples$tissue_type)
##
## Normal Tumor
## 36 412
At the end, you will need to read a research paper related to the type of cancer you have chosen.