GenomicFeatures and GenomicRanges packages

Read the CpG island data as previously shown (refer to the Input/Output or Conditional/Loops exercises). Assign it to the variable cpgi, using stringsAsFactors = FALSE.

Subset cpgi, keeping only CpG islands located on canonical chromosomes (chr1–chr22, chrX, and chrY). Reassign the result to cpgi.

Convert cpgi into a GRanges Object. Use the options keep.extra.columns = TRUE and ignore.strand = TRUE.
Hint: If necessary, adjust the seqnames.field, start.field, end.field, and strand.field options.

Assign the values from the name column as the names of the object using names(), then remove the name column.

Install the TxDb package related to the human UCSC hg18 assembly, then load it and assign it to the variable genome. Take some time to explore this object.

Extract promoters from the genome object and assign them to the variable prom. Extend the TSS by 1000 bp upstream and 100 bp downstream (into the gene body).
How many ranges do you obtain?
(If you receive a warning message, ignore it: it refers to a non-canonical chromosome)

Keep only the ranges that belong to canonical chromosomes (chr1–chr22, chrX, and chrY), and reassign the result to prom.
After subsetting, remember to update seqlevels.

Retrieve the CpG islands that overlap with promoters and assign them to a new variable cpg_prom.

How many CpG islands are present in the initial cpgi object?
How many CpG islands intersect with promoters prom?

Hint: When retrieving the query hits from the overlap, use the unique() function to avoid redundant ranges.

Read the BED file containing methylation sites (H3K4me3) in untreated HeLa cells (H3K4me3_unstim_hg18_xset200_dupsN_ht5.sub.peaks_manipulated.bed in the Datasets folder) and assign it to the variable meth.
The data were retrieved from (BCGSC)[https://www.bcgsc.ca/data/histone-modification/histone-modification-data] and slightly modified.

Find overlaps between cpg_prom and meth, then:

Retrieve the subset of cpg_prom that overlaps with meth and assign it to the variable cpg_prom_Ov
Hint: Use unique() to retrieve only unique positions.
Retrieve the subset of meth that overlaps with cpg_prom and assign it to the variable meth_Ov.
Hint: Use unique() to retrieve only unique positions.

Import the GTF file for Mouse version M24 located in the Datasets folder. Use the makeTxDbFromGFF() function. (Note: This operation may take some time.)
Assign the result to the variable mouse.
Explore this object using the columns() function: it contains a lot of useful information.

Extract transcripts from mouse and assign them to the variable transc. Use the parameter columns = c("tx_name", "gene_id") to include additional information.

Create a vector all_transcripts containing all unique transcript names from transc.
Then, evaluate the length of the vector.