[STAT/AP] Stefan van Aelst: Split modeling ensembles

05 februari 2024 15:45 t/m 16:45 - Locatie: EEMCS Hall G | Zet in mijn agenda

"There are two popular approaches to analyze high-dimensional regression problems. On the one hand, sparse can be used that yield models with a limited number of predictors and therefore are highly interpretable. Well-known methods are best subset selection, the lasso and elastic net.
On the other hand, ensemble methods can be used which often outperform sparse methods in terms of prediction accuracy. Ensemble methods are usually “blackbox” methods that combine information from a large number of models, but lack interpretability. Well-known examples are
random forests and xgboost.

We aim to combine the strength of ensembling with the interpretability of sparse methods. To this end we introduce the split modeling framework to learn a small number of sparse models simultaneously from the data. We discuss the accuracy-diversity trade-off for ensembles which leads to an objective function that learns a few sparse and diverse models simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the outcome of interest. To minimize this objective function, we extend recent developments in optimization for sparse methods to our multi-model framework. By making a good choice for the ensembling function, the interpretability of each of the obtained models is retained by the resulting ensemble model. We show that at the same time these ensembles achieve excellent prediction accuracy and has good properties. We illustrate the good performance of our split modeling ensembles on gene expression data, considering both regression and classification problems."