TY - GEN
T1 - Model-based clustering with gene ranking using penalized mixtures of heavy-tailed distributions
AU - Cozzini, Alberto
AU - Jasra, Ajay
AU - Montana, Giovanni
N1 - Generated from Scopus record by KAUST IRTS on 2019-11-20
PY - 2013/6/1
Y1 - 2013/6/1
N2 - Cluster analysis of biological samples using gene expression measurements is a common task which aids the discovery of heterogeneous biological sub-populations having distinct mRNA profiles. Several model-based clustering algorithms have been proposed in which the distribution of gene expression values within each sub-group is assumed to be Gaussian. In the presence of noise and extreme observations, a mixture of Gaussian densities may over-fit and overestimate the true number of clusters. Moreover, commonly used model-based clustering algorithms do not generally provide a mechanism to quantify the relative contribution of each gene to the final partitioning of the data. We propose a penalized mixture of Student's t distributions for model-based clustering and gene ranking. Together with a resampling procedure, the proposed approach provides a means for ranking genes according to their contributions to the clustering process. Experimental results show that the algorithm performs well comparably to traditional Gaussian mixtures in the presence of outliers and longer tailed distributions. The algorithm also identifies the true informative genes with high sensitivity, and achieves improved model selection. An illustrative application to breast cancer data is also presented which confirms established tumor sub-classes. © Imperial College Press.
AB - Cluster analysis of biological samples using gene expression measurements is a common task which aids the discovery of heterogeneous biological sub-populations having distinct mRNA profiles. Several model-based clustering algorithms have been proposed in which the distribution of gene expression values within each sub-group is assumed to be Gaussian. In the presence of noise and extreme observations, a mixture of Gaussian densities may over-fit and overestimate the true number of clusters. Moreover, commonly used model-based clustering algorithms do not generally provide a mechanism to quantify the relative contribution of each gene to the final partitioning of the data. We propose a penalized mixture of Student's t distributions for model-based clustering and gene ranking. Together with a resampling procedure, the proposed approach provides a means for ranking genes according to their contributions to the clustering process. Experimental results show that the algorithm performs well comparably to traditional Gaussian mixtures in the presence of outliers and longer tailed distributions. The algorithm also identifies the true informative genes with high sensitivity, and achieves improved model selection. An illustrative application to breast cancer data is also presented which confirms established tumor sub-classes. © Imperial College Press.
UR - https://www.worldscientific.com/doi/abs/10.1142/S0219720013410072
UR - http://www.scopus.com/inward/record.url?scp=84879179728&partnerID=8YFLogxK
U2 - 10.1142/S0219720013410072
DO - 10.1142/S0219720013410072
M3 - Conference contribution
BT - Journal of Bioinformatics and Computational Biology
ER -