Mining Structural and Functional Patterns in Pathogenic and Benign Genetic Variants through Non-negative Matrix Factorization

  • Karla A Peña-Guerra

Student thesis: Master's Thesis


The main challenge in studying genetics has evolved from identifying variations and their impact on traits to comprehending the molecular mechanisms through which genetic variations affect human biology, including disease susceptibility. Despite having identified a vast number of variants associated with human traits through large scale genome wide association studies (GWAS) a significant portion of them still lack detailed insights into their underlying mechanisms [1]. Addressing this uncertainty requires the development of precise and scalable approaches to discover how genetic variation precisely influences phenotypes at a molecular level. In this study, we developed a pipeline to automate the annotation of structural variant feature effects. We applied this pipeline to a dataset of 33,942 variants from the ClinVar and GnomAD databases, which included both pathogenic and benign associations. To bridge the gap between genetic variation data and molecular phenotypes, I implemented Non-negative Matrix Factorization (NMF) on this large-scale dataset. This algorithm revealed 6 distinct clusters of variants with similar feature profiles. Among these groups, two exhibited a predominant presence of benign variants (accounting for 70% and 85% of the clusters), while one showed an almost equal distribution of pathogenic and benign variants. The remaining three groups were predominantly composed of pathogenic variants, comprising 68%, 83%, and 77% of the respective clusters. These findings revealed valuable insights into the underlying mechanisms contributing to pathogenicity. Further analysis of this dataset and the exploration of disease-related genes can enhance the accuracy of genetic diagnosis and therapeutic development through the direct inference of variants that are likely to affect the functioning of essential genes.
Date of AwardAug 2023
Original languageEnglish (US)
Awarding Institution
  • Biological, Environmental Sciences and Engineering
SupervisorStefan Arold (Supervisor)


  • genetics
  • genetic variants
  • pathogenic
  • variant annotation
  • structural features
  • Non-negative Matrix Factorization
  • clusters

Cite this