Identifying Regulatory Patterns at the 3'end Regions of Over-expressed and Under-expressed Genes

  • Ghofran K. Othoum

Student thesis: Master's Thesis


Promoters, neighboring regulatory regions and those extending further upstream of the 5’end of genes, are considered one of the main components affecting the expression status of genes in a specific phenotype. More recently research by Chen et al. (2006, 2012) and Mapendano et al. (2010) demonstrated that the 3’end regulatory regions of genes also influence gene expression. However, the association between the regulatory regions surrounding 3’end of genes and their over- or under-expression status in a particular phenotype has not been systematically studied. The aim of this study is to ascertain if regulatory regions surrounding the 3’end of genes contain sufficient regulatory information to correlate genes with their expression status in a particular phenotype. Over- and under-expressed ovarian cancer (OC) genes were used as a model. Exploratory analysis of the 3’end regions were performed by transforming the annotated regions using principal component analysis (PCA), followed by clustering the transformed data thereby achieving a clear separation of genes with different expression status. Additionally, several classification algorithms such as Naïve Bayes, Random Forest and Support Vector Machine (SVM) were tested with different parameter settings to analyze the discriminatory capacity of the 3’end regions of genes related to their gene expression status. The best performance was achieved using the SVM classification model with 10-fold cross-validation that yielded an accuracy of 98.4%, sensitivity of 99.5% and specificity of 92.5%. For gene expression status for newly available instances, based on information derived from the 3’end regions, an SVM predictive model was developed with 10-fold cross-validation that yielded an accuracy of 67.0%, sensitivity of 73.2% and specificity of 61.0%. Moreover, building an SVM with polynomial kernel model to PCA transformed data yielded an accuracy of 83.1%, sensitivity of 92.5% and specificity of 74.8% using 10-fold cross-validation for evaluation. These clustering and classification analyses strongly suggest that the regions surrounding the 3’end of genes contain sufficiently rich regulatory information to discriminate between over- and under-expressed genes; at least in the case of genes implicated in OC.
Date of AwardMay 2013
Original languageEnglish (US)
Awarding Institution
  • Biological, Environmental Sciences and Engineering
SupervisorVladimir Bajic (Supervisor)


  • 3'end regions
  • regulatory regions
  • data mining
  • clustering analysis
  • classification model
  • ovarian cancer

Cite this