## Stacking Ensemble: a quick review

Maxim Fedotov ’22 (Data Science Methodology)

## Intro

Stacking is an ensemble method that is used widely in supervised learning. As always, we have some training data, and the goal is to predict target variable in new data. The basic structure of the method consists in two levels of learners: base and meta. The main idea is that the meta learner combines predictions of the base learners to provide final response. That is, predictions of the base learners are used to calibrate the meta learner. The beauty of ensemble techniques in general is that they allow to capture various aspects of patterns in data using heterogeneous machine learning models. It explains why ensembles often happen to exhibit strong predictive power. For peak performance, it is recommended to have some degree of diversity in the base learners and sufficiently small correlation among their predictions.

Of course, at first sight, this algorithm might seem to be prone to overfitting. However, stacking is constructed in a way that helps to avoid it. In fact, it is the most subtle part of the technique. In short, while calibrating the meta learner, one uses cross-validation type of procedure to combine predictions of the base learners.

Along the text, I will use some pseudo-code notation which I will be defining on the go. Also, the text was prepared having in mind the stacking ensemble implementations from `scikit-learn` and `mlens` Python packages.

The purpose of this text is to describe meta learner calibration and base learners training separately to avoid any misinterpretations that could arise when mixing these components together.

I start with a short description of a basic supervised learning setup to define some notation. Then, I proceed to how the meta learner is trained. Finally, I explain how one obtains predictions for new data. Take a coffee, and let’s jump in!

## Setup

In this text, we consider a basic numerical data setup. That is, we have a target which we want to predict, and the feature matrix which carries information about some features of `n` observations.

Let’s quickly setup some notation:

1. Data: `X_train` (feature matrix), `y_train` (target).
2. Number of rows (observations) in the training data: `n`.
3. Cross-validation folds (disjoint): `cv_folds = [fold_1, …, fold_K]`, such that `union(cv_folds) = {1, … , n}`.
4. Base learners: `base_learners = [base_learner[1], ..., base_learner[B]]`.
5. Meta learner: `meta_learner`.
6. New data: `X_new`.

The learners here are generic objects that define a learning model, that can be fitted on data via applying a generic method `.fit()` and predict a target via method `.predict()`. Whenever the learners are trained, I mention it explicitly. With this setup, we proceed to describing how one trains the meta learner.

## Stacking Ensemble Training

Now, we are ready to discuss how we can train a stacking ensemble and then obtain predictions for new data.

### Meta learner calibration

One first fits the base learners on the training data using a cross-validation approach. Remember, our final goal here is to train the meta learner, which will be producing final target predictions. So, we do not want to propagate information about target realisation of the i-th observation into a base learner prediction for it since doing this will cause extreme overfitting of the meta learner. That is why we use a cross-validation type of procedure. Then, when we have predictions of the base learners obtained via the cross-validation approach, we can concatenate them with our initial feature matrix, i.e. considering them as new features for the meta learner. So, to train the meta learner one simply fits it to the target given the extended feature matrix. That is it!

Now, let’s summarise the procedure into a piece of pseudo-code:

Note that construction of `X_meta` may vary from one implementation to another. For example, one may choose a subset of features to use (propagate) in the meta-learner. Attention: do not confuse it with a concept of feature propagation that was introduced to cope with a problem of missing data in learning on graphs by Rossi, E., Kenlay, H., Gorinova, M.I., Chamberlain, B.P., Dong, X. and Bronstein, M.M. in the paper “On the unreasonable effectiveness of feature propagation in learning on graphs with missing node features.” Learning on Graphs Conference; PMLR, 2022.

As we can see, the structure is not complex. Of course, there are some technicalities that do not appear in this demonstrative pseudo-code. For example, one might want to do the cross-validation and base learners training in parallel to speed up performance. For a reference on the state-of-the-art implementation in Python, check out `scikit-learn` and `mlens` packages.

### Base learners training

I describe this step after meta learner calibration on purpose so that we do not confuse the former with a part of the latter. So far, we have calibrated the meta learner. However, to predict the target for new data we also need to train the base learners so they are able to produce base predictors for the unseen dataset.

The procedure is straightforward. One can just fit each base learner on the whole training dataset `X_train, y_train`. Again, when training the meta learner we were using the cross-validation approach to avoid excessive overfitting of the meta learner. Here, we do not have to do this since we will use the base learners to obtain base predictions for newly occurred data that was never seen by the model. So, we can utilize all the data that we have.

## Stacking Ensemble prediction

At this point, we have the meta learner calibrated and the base learners trained. So, to predict target values for newly occurred data, we first obtain the base target predictions, and then use the meta learner.

## Conclusion

Ensemble methods are particularly known for their decent prediction performance in supervised-learning setups if used appropriately. Stacking ensembles exhibit hierarchical structure with two levels: base and meta. Meta learner combines responses of base learners to provide final prediction of a target variable. To avoid overfitting, meta learner is trained involving cross-validation type of procedure used to obtain base-learner predictions.

I discuss how stacking ensembles can be trained and used for prediction in supervised learning problems. I decided to maintain some level of generality in the method description while providing pseudo-code examples.

Hope it helps!

## Connect with Maxim

Maxim Fedotov ’22 is an MRes student in Statistics at Universitat Pompeu Fabra. He is an alum of the BSE Master’s in Data Science Methodology.

## Individual recourse for Black Box Models

Explained intuitively by Patrick Altmeyer (Finance ’18, Data Science ’21) through a tale of cats and dogs

Is artificial intelligence (AI) trustworthy? If, like me, you have recently been gobsmacked by the Netflix documentary Coded Bias, then you were probably quick to answer that question with a definite “no”. The show documents the efforts of a group of researchers headed by Joy Buolamwini, that aims to inform the public about the dangers of AI.

One particular place where AI has already wreaked havoc is automated decision making. While automation is intended to liberate decision making processes of human biases and judgment error, it all too often simply encodes these flaws, which at times leads to systematic discrimination of individuals. In the eyes of Cathy O’Neil, another researcher appearing on Coded Bias, this is even more problematic than discrimation through human decision makers because “You cannot appeal to [algorithms]. They do not listen. Nor do they bend.” What Cathy is referring to here is the fact that individuals who are at the mercy of automated decision making systems usually lack the necessary means to challenge the outcome that the system has determined for them.

In my recent post on Towards Data Science,  I look at a novel algorithmic solution to this problem. The post is based primarily on a paper by Joshi et al. (2019) in which the authors develop a simple, but ingenious idea: instead of concerning ourselves with interpretability of black-box decision making systems (DMS), how about just providing individuals with actionable recourse to revise undesirable outcomes? Suppose for example that you have been rejected from your dream job, because an automated DMS has decided that you do not meet the shortlisting criteria for the position. Instead of receiving a standard rejection email, would it not be more helpful to be provided with a tailored set of actions you can take in order to be more successful on your next attempt?

The methodology proposed by Joshi et al. (2019) and termed REVISE is an attempt to put this idea into practice. For my post I chose a more light-hearted topic than job rejections to illustrate the approach. In particular, I demonstrate how REVISE can be used to provide individual recourse to Kitty 🐱, a young cat that identifies as a dog. Based on information about her long tail and short overall height, a linear classifier has decided to label Kitty as a cat along with all the other cats that share similar attributes (Figure below). REVISE sends Kitty on the shortest possible route to being classified as a dog 🐶 . She just needs to grow a few inches and fold up her tail (Figure below).

The following summary should give you some flavour of how the algorithm works:

1. Initialise x, that is the attributes that will be revised recursively. Kitty’s original attributes seem like a reasonable place to start.
2. Through gradient descent recursively revise x until g(x*)=🐶. At this point the descent terminates since for these revised attributes the classifier labels Kitty as a dog.
3. Return x*-x, that is the individual recourse for Kitty.

This illustrative example is of course a bit silly and should not detract from the fact that the potential real-world use cases of the algorithm are serious and reach many domains. The work by Joshi et al. adds to a growing body of literature that aims to make AI more trustworthy and transparent. This will be decisive in applications of AI to domains like Economics, Finance and Public Policy, where decision makers and individuals rightfully insist on model interpretability and explainability.

The article was featured on TDS’ Editor’s Picks and has been added to their Model Interpretability column. This link takes you straight to the publication. Readers with an appetite for technical details around the implementation of stochastic gradient descent and the REVISE algorithm in R may also want to have a look at the original publication on my personal blog.

## Connect with the author

Following his first Master’s at Barcelona GSE (Finance Program), Patrick Altmeyer worked as an economist for the Bank of England for two years. He is currently finishing up the Master’s in Data Science at Barcelona GSE.

Upon graduation Patrick will remain in academia to pursue a PhD in Trustworthy Artificial Intelligence at Delft University of Technology.

## #ICYMI on the BGSE Data Science blog: Randomized Numerical Linear Algebra (RandNLA) For Least Squares: A Brief Introduction

Dimensionality reduction is a topic that has governed our (the 2017 BGSE Data Science cohort) last three months. At the heart of topics such as penalized likelihood estimation (Lasso, Ridge, Elastic Net, etc.), principal component analysis and best subset selection lies the fundamental trade-off between complexity, generalizability and computational feasibility.

David Rossell taught us that even if we have found a methodology to compare across models, there is still the problem of enumerating all models to be compared… read the full post by Robert Lange ’17 on Barcelona GSE Data Scientists