VARIABLE SELECTION IN HIGH DIMENSIONAL APPLICATIONS

This project is motivated by a biomedical application, carried out as part of a collaboration between Carlos III University and Gregorio Marañon Hospital in Madrid. In the last decade, the cost of obtaining the sequenced RNA of cancer cells has been significantly reduced. The new challenge is how to analyze all that information to be able to understand and help the patients.
Our research deals with a modern statistical challenge, closely related to computational advances, that has been in the spotlight for the last couple of decades. Specifically, we are interested in situations where the number of variables is approximately the same order or even larger than the number of observations. In those scenarios, classical statistical methods fail to deliver correct results, due to overfitting or the need to solve a numerically ill-posed problem.
As part of this project, we have developed several R packages.
sglfast is an R package to solve the Sparse-Group Lasso regression, with individual group regularization parameters, and the iterative sparse-group lasso (isgl), an algorithm to select the optimal regularization parameters of the Sparse-Group Lasso.
Implements the Group Linear Algorithm with Sparse Principal decomposition, a variable selection and clustering method for generalized linear models. R package
glasp is an extension of the Sparse-Group Lasso that computes groups automatically. The internal supervised variable clustering algorithm is also an original contribution and integrates naturally within the Group Lasso penalty. Moreover, this implementation provides the flexibility to change the risk function and address any regression problem.

Collaborator: Hospital Gregorio Marañón