Skip to content
Prediction of biological age in colon cancer patients using epigenetic data
  • AI Chat
  • Code
  • Report
  • Spinner

    Introduction

    The goal of this project is to perform an exploratory data analsis on DNA methylation data in order to acquire biological insights related to CpGs that are relevant for various aspects of colon cancer e.g. tumour staging, biological age, etc. A CpG is a C, followed by a G but from the end to the end of a DNA strand.

    Epigenetic mechanisms regulate gene expression. Certain technologies enable us measure some epigentic endpoints. One example is DNA methylation. DNA methylation is a chemical process that occurs around genes and which is capable of silencing gene expression. In this project, I will analyse DNA methylation data for the whole genome and relate it to phenotypic variation.

    It has been studied that when part of a genome is methylated, the gene close to it is not expressed. It is inherited at mitosis.

    Methylation often occurs at CpG islands.

    When one DNA strand is methylated, so is the other. if we also have a CpG on one strand, we have it as well on the other strand.When DNA replicates, its methylation characteristics are preserved.

    In this project I will be performing exploratory data analysis on public DNA methylation data, from genomicsclass.

    ## install necessary packages
    # devtools::install_github("genomicsclass/coloncancermeth")
     if (!require("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    
    BiocManager::install("IRanges")
    ## load packages
    library(ggplot2)
    # library(coloncancermeth)
    library(IRanges)
    # library(S4Vectors)
    # load the data
    coloncancermeth <- load("coloncancermeth.rda")
    coloncancermeth
    dim(meth)
    str(meth)

    Overview of the metadata

    colnames(pd)

    Perform PCA

    The goal of performing PCA is to identify the CPGs that are strongly predictive of the age. The approach consists of identifying the principal components whose loading scores have a significant relationship with the outcome of interest which is the age. The top 24 CPGs with the highest loading scores in the PC with significant relationship with the age will be selected to serve as predictors in a linear model with age as outcome. The model will be used predict the age. This prediction will constitute the biological age.

    pca <- prcomp(t(meth), scale = TRUE)
    pca.data <- data.frame(Sample=pd$Status,
      X=pca$x[,1],
      Y=pca$x[,2])
    pca.data
     
    ggplot(data=pca.data, aes(x=X, y=Y, 
    						  label=Sample, color=Sample)) +
      geom_text() +
      xlab(paste("PC1 - ", pca.var.per[1], "%", sep="")) +
      ylab(paste("PC2 - ", pca.var.per[2], "%", sep="")) +
      theme_bw() +
      ggtitle("PCA of colon tissues")+
       theme(plot.title = element_text(hjust = 0.5))
     
    ## make a scree plot
    pca.var <- pca$sdev^2
    pca.var.per <- round(pca.var/sum(pca.var)*100, 1)
    pca.var.per
     
    barplot(pca.var.per, main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")

    According to the bar plot above, PC1 accounts for nearly 27% of the variation in the dataset. The next step is to identify which principal components have a significant correlation with the age.

    pcs <- as.data.frame(pca$x)
    
    age <- pd$patient.age_at_initial_pathologic_diagnosis
    
    p_values <- sapply(pcs, function(x){
    	 cort_test_p <- cor.test(x, age)$p.value
    	 cort_test_p
    })
    p_values[p_values <= 0.1]

    Results of pca show that PC2 has a significant correlation with the age, meanwhile, PC10, and PC20 are the pcs with (near-) significant correlation with the age. The next step will be to select the cpgs top 24 cpgs that have strongest loading scores on the 2nd PC. I select 24 since there are only 27 observations in the data and selecting more than 24 features will overwhelm the model.

    ## get the name of the top 11000 measurements (cpgs) that contribute most to pc1.
    loading_scores <- pca$rotation[,2]
    cpgs_scores <- abs(loading_scores) ## get the magnitudes
    cpgs_score_ranked <- sort(cpgs_scores, decreasing=TRUE)
    top_cpgs_pc2<- names(cpgs_score_ranked[1:24])