## Stacking Ensemble: a quick review

Maxim Fedotov ’22 (Data Science Methodology)

## Intro

Stacking is an ensemble method that is used widely in supervised learning. As always, we have some training data, and the goal is to predict target variable in new data. The basic structure of the method consists in two levels of learners: base and meta. The main idea is that the meta learner combines predictions of the base learners to provide final response. That is, predictions of the base learners are used to calibrate the meta learner. The beauty of ensemble techniques in general is that they allow to capture various aspects of patterns in data using heterogeneous machine learning models. It explains why ensembles often happen to exhibit strong predictive power. For peak performance, it is recommended to have some degree of diversity in the base learners and sufficiently small correlation among their predictions.

Of course, at first sight, this algorithm might seem to be prone to overfitting. However, stacking is constructed in a way that helps to avoid it. In fact, it is the most subtle part of the technique. In short, while calibrating the meta learner, one uses cross-validation type of procedure to combine predictions of the base learners.

Along the text, I will use some pseudo-code notation which I will be defining on the go. Also, the text was prepared having in mind the stacking ensemble implementations from `scikit-learn` and `mlens` Python packages.

The purpose of this text is to describe meta learner calibration and base learners training separately to avoid any misinterpretations that could arise when mixing these components together.

I start with a short description of a basic supervised learning setup to define some notation. Then, I proceed to how the meta learner is trained. Finally, I explain how one obtains predictions for new data. Take a coffee, and let’s jump in!

## Setup

In this text, we consider a basic numerical data setup. That is, we have a target which we want to predict, and the feature matrix which carries information about some features of `n` observations.

Let’s quickly setup some notation:

1. Data: `X_train` (feature matrix), `y_train` (target).
2. Number of rows (observations) in the training data: `n`.
3. Cross-validation folds (disjoint): `cv_folds = [fold_1, …, fold_K]`, such that `union(cv_folds) = {1, … , n}`.
4. Base learners: `base_learners = [base_learner[1], ..., base_learner[B]]`.
5. Meta learner: `meta_learner`.
6. New data: `X_new`.

The learners here are generic objects that define a learning model, that can be fitted on data via applying a generic method `.fit()` and predict a target via method `.predict()`. Whenever the learners are trained, I mention it explicitly. With this setup, we proceed to describing how one trains the meta learner.

## Stacking Ensemble Training

Now, we are ready to discuss how we can train a stacking ensemble and then obtain predictions for new data.

### Meta learner calibration

One first fits the base learners on the training data using a cross-validation approach. Remember, our final goal here is to train the meta learner, which will be producing final target predictions. So, we do not want to propagate information about target realisation of the i-th observation into a base learner prediction for it since doing this will cause extreme overfitting of the meta learner. That is why we use a cross-validation type of procedure. Then, when we have predictions of the base learners obtained via the cross-validation approach, we can concatenate them with our initial feature matrix, i.e. considering them as new features for the meta learner. So, to train the meta learner one simply fits it to the target given the extended feature matrix. That is it!

Now, let’s summarise the procedure into a piece of pseudo-code:

Note that construction of `X_meta` may vary from one implementation to another. For example, one may choose a subset of features to use (propagate) in the meta-learner. Attention: do not confuse it with a concept of feature propagation that was introduced to cope with a problem of missing data in learning on graphs by Rossi, E., Kenlay, H., Gorinova, M.I., Chamberlain, B.P., Dong, X. and Bronstein, M.M. in the paper “On the unreasonable effectiveness of feature propagation in learning on graphs with missing node features.” Learning on Graphs Conference; PMLR, 2022.

As we can see, the structure is not complex. Of course, there are some technicalities that do not appear in this demonstrative pseudo-code. For example, one might want to do the cross-validation and base learners training in parallel to speed up performance. For a reference on the state-of-the-art implementation in Python, check out `scikit-learn` and `mlens` packages.

### Base learners training

I describe this step after meta learner calibration on purpose so that we do not confuse the former with a part of the latter. So far, we have calibrated the meta learner. However, to predict the target for new data we also need to train the base learners so they are able to produce base predictors for the unseen dataset.

The procedure is straightforward. One can just fit each base learner on the whole training dataset `X_train, y_train`. Again, when training the meta learner we were using the cross-validation approach to avoid excessive overfitting of the meta learner. Here, we do not have to do this since we will use the base learners to obtain base predictions for newly occurred data that was never seen by the model. So, we can utilize all the data that we have.

## Stacking Ensemble prediction

At this point, we have the meta learner calibrated and the base learners trained. So, to predict target values for newly occurred data, we first obtain the base target predictions, and then use the meta learner.

## Conclusion

Ensemble methods are particularly known for their decent prediction performance in supervised-learning setups if used appropriately. Stacking ensembles exhibit hierarchical structure with two levels: base and meta. Meta learner combines responses of base learners to provide final prediction of a target variable. To avoid overfitting, meta learner is trained involving cross-validation type of procedure used to obtain base-learner predictions.

I discuss how stacking ensembles can be trained and used for prediction in supervised learning problems. I decided to maintain some level of generality in the method description while providing pseudo-code examples.

Hope it helps!

## Connect with Maxim

Maxim Fedotov ’22 is an MRes student in Statistics at Universitat Pompeu Fabra. He is an alum of the BSE Master’s in Data Science Methodology.