Pipeline for Efficient Mapping of Transcription Factor Binding Sites and Comparison of Their Models

  • Wail Ba Alawi

Student thesis: Master's Thesis

Abstract

The control of genes in every living organism is based on activities of transcription factor (TF) proteins. These TFs interact with DNA by binding to the TF binding sites (TFBSs) and in that way create conditions for the genes to activate. Of the approximately 1500 TFs in human, TFBSs are experimentally derived only for less than 300 TFs and only in generally limited portions of the genome. To be able to associate TF to genes they control we need to know if TFs will have a potential to interact with the control region of the gene. For this we need to have models of TFBS families. The existing models are not sufficiently accurate or they are too complex for use by ordinary biologists. To remove some of the deficiencies of these models, in this study we developed a pipeline through which we achieved the following: 1. Through a comparison analysis of the performance we identified the best models with optimized thresholds among the four different types of models of TFBS families. 2. Using the best models we mapped TFBSs to the human genome in an efficient way. The study shows that a new scoring function used with TFBS models based on the position weight matrix of dinucleotides with remote dependency results in better accuracy than the other three types of the TFBS models. The speed of mapping has been improved by developing a parallelized code and shows a significant speed up of 4x when going from 1 CPU to 8 CPUs. To verify if the predicted TFBSs are more accurate than what can be expected with the conventional models, we identified the most frequent pairs of TFBSs (for TFs E4F1 and ATF6) that appeared close to each other (within the distance of 200 nucleotides) over the human genome. We show unexpectedly that the genes that are most close to the multiple pairs of E4F1/ATF6 binding sites have a co-expression of over 90%. This indirectly supports our hypothesis that the TFBS models we use are more accurate and also suggests that the E4F1/ATF6 pair is exerting the control over these genes.
Date of AwardJun 2011
Original languageEnglish (US)
Awarding Institution
  • Computer, Electrical and Mathematical Sciences and Engineering
SupervisorVladimir Bajic (Supervisor)

Cite this

'