Abstract
Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of feature selection in which only a subset of the predictors Xt are dependent on the multidimensional variate Y, and the remainder of the predictors constitute a “noise set” Xu independent of Y. Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine “empirical bounds” on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the 2012 SIAM International Conference on Data Mining |
Publisher | Society for Industrial & Applied Mathematics (SIAM) |
ISBN (Print) | 9781611972320 |
DOIs | |
State | Published - Dec 18 2013 |
Externally published | Yes |