Stacking Ensemble: a quick review

Maxim Fedotov ’22 (Data Science Methodology)

Stacking Ensemble diagram
Stacking Ensemble

Intro

Stacking is an ensemble method that is used widely in supervised learning. As always, we have some training data, and the goal is to predict target variable in new data. The basic structure of the method consists in two levels of learners: base and meta. The main idea is that the meta learner combines predictions of the base learners to provide final response. That is, predictions of the base learners are used to calibrate the meta learner. The beauty of ensemble techniques in general is that they allow to capture various aspects of patterns in data using heterogeneous machine learning models. It explains why ensembles often happen to exhibit strong predictive power. For peak performance, it is recommended to have some degree of diversity in the base learners and sufficiently small correlation among their predictions.

Of course, at first sight, this algorithm might seem to be prone to overfitting. However, stacking is constructed in a way that helps to avoid it. In fact, it is the most subtle part of the technique. In short, while calibrating the meta learner, one uses cross-validation type of procedure to combine predictions of the base learners.

Along the text, I will use some pseudo-code notation which I will be defining on the go. Also, the text was prepared having in mind the stacking ensemble implementations from scikit-learn and mlens Python packages.

The purpose of this text is to describe meta learner calibration and base learners training separately to avoid any misinterpretations that could arise when mixing these components together.

I start with a short description of a basic supervised learning setup to define some notation. Then, I proceed to how the meta learner is trained. Finally, I explain how one obtains predictions for new data. Take a coffee, and let’s jump in!

Setup

In this text, we consider a basic numerical data setup. That is, we have a target which we want to predict, and the feature matrix which carries information about some features of n observations.

Let’s quickly setup some notation:

  1. Data: X_train (feature matrix), y_train (target).
  2. Number of rows (observations) in the training data: n.
  3. Cross-validation folds (disjoint): cv_folds = [fold_1, …, fold_K], such that union(cv_folds) = {1, … , n}.
  4. Base learners: base_learners = [base_learner[1], ..., base_learner[B]].
  5. Meta learner: meta_learner.
  6. New data: X_new.

The learners here are generic objects that define a learning model, that can be fitted on data via applying a generic method .fit() and predict a target via method .predict(). Whenever the learners are trained, I mention it explicitly. With this setup, we proceed to describing how one trains the meta learner.

Stacking Ensemble Training

Now, we are ready to discuss how we can train a stacking ensemble and then obtain predictions for new data.

Meta learner calibration

One first fits the base learners on the training data using a cross-validation approach. Remember, our final goal here is to train the meta learner, which will be producing final target predictions. So, we do not want to propagate information about target realisation of the i-th observation into a base learner prediction for it since doing this will cause extreme overfitting of the meta learner. That is why we use a cross-validation type of procedure. Then, when we have predictions of the base learners obtained via the cross-validation approach, we can concatenate them with our initial feature matrix, i.e. considering them as new features for the meta learner. So, to train the meta learner one simply fits it to the target given the extended feature matrix. That is it!

Now, let’s summarise the procedure into a piece of pseudo-code:

Algorithm for meta learner training

Note that construction of X_meta may vary from one implementation to another. For example, one may choose a subset of features to use (propagate) in the meta-learner. Attention: do not confuse it with a concept of feature propagation that was introduced to cope with a problem of missing data in learning on graphs by Rossi, E., Kenlay, H., Gorinova, M.I., Chamberlain, B.P., Dong, X. and Bronstein, M.M. in the paper “On the unreasonable effectiveness of feature propagation in learning on graphs with missing node features.” Learning on Graphs Conference; PMLR, 2022.

As we can see, the structure is not complex. Of course, there are some technicalities that do not appear in this demonstrative pseudo-code. For example, one might want to do the cross-validation and base learners training in parallel to speed up performance. For a reference on the state-of-the-art implementation in Python, check out scikit-learn and mlens packages.

Meta learner training

Base learners training

I describe this step after meta learner calibration on purpose so that we do not confuse the former with a part of the latter. So far, we have calibrated the meta learner. However, to predict the target for new data we also need to train the base learners so they are able to produce base predictors for the unseen dataset.

The procedure is straightforward. One can just fit each base learner on the whole training dataset X_train, y_train. Again, when training the meta learner we were using the cross-validation approach to avoid excessive overfitting of the meta learner. Here, we do not have to do this since we will use the base learners to obtain base predictions for newly occurred data that was never seen by the model. So, we can utilize all the data that we have.

Stacking Ensemble prediction

At this point, we have the meta learner calibrated and the base learners trained. So, to predict target values for newly occurred data, we first obtain the base target predictions, and then use the meta learner.

Algorithm for Stacking Ensemble prediction.

Conclusion

Ensemble methods are particularly known for their decent prediction performance in supervised-learning setups if used appropriately. Stacking ensembles exhibit hierarchical structure with two levels: base and meta. Meta learner combines responses of base learners to provide final prediction of a target variable. To avoid overfitting, meta learner is trained involving cross-validation type of procedure used to obtain base-learner predictions.

I discuss how stacking ensembles can be trained and used for prediction in supervised learning problems. I decided to maintain some level of generality in the method description while providing pseudo-code examples.

Hope it helps!

References

Connect with Maxim

Portrait of Maxim Fedotov

Maxim Fedotov ’22 is an MRes student in Statistics at Universitat Pompeu Fabra. He is an alum of the BSE Master’s in Data Science Methodology.

Machine learning model to predict mental health crises from electronic health records

Publication in Nature Medicine by Roger Garriga ’17 and Javier Mas ’17 (Data Science) et al

Article cover in Nature Medicine

The use of machine learning in healthcare is still in its infancy. In this paper, we describe the project we did to predict psychotic episodes together with Birmingham’s psychiatric hospital. We hope to see these sorts of applications of ML in healthcare become the new standard in the future. The technology is ready, so it’s just a matter of getting it done!

Paper abstract

The timely identification of patients who are at risk of a mental health crisis can lead to improved outcomes and to the mitigation of burdens and costs. However, the high prevalence of mental health problems means that the manual review of complex patient records to make proactive care decisions is not feasible in practice. Therefore, we developed a machine learning model that uses electronic health records to continuously monitor patients for risk of a mental health crisis over a period of 28 days. The model achieves an area under the receiver operating characteristic curve of 0.797 and an area under the precision-recall curve of 0.159, predicting crises with a sensitivity of 58% at a specificity of 85%. A follow-up 6-month prospective study evaluated our algorithm’s use in clinical practice and observed predictions to be clinically valuable in terms of either managing caseloads or mitigating the risk of crisis in 64% of cases. To our knowledge, this study is the first to continuously predict the risk of a wide range of mental health crises and to explore the added value of such predictions in clinical practice.

(You can also read about the project in more detail in this article from UPF)

Citation

Garriga, R., Mas, J., Abraha, S. et al. Machine learning model to predict mental health crises from electronic health records. Nat Med (2022). https://doi.org/10.1038/s41591-022-01811-5

Connect with BSE authors

Roger Garriga ’17 is a Research Data Scientist at Koa Health. He is an alum of the BSE Master’s in Data Science.

Javier Mas ’17 is Lead Data Scientist at Kannact. He is an alum of the BSE Master’s in Data Science.

Understanding Latent Vector Arithmetic for Attribute Manipulation in Normalizing Flows

Data Science master project by Eduard Gimenez Funes ’21

Five portraits of the same man with different facial expressions

Editor’s note: This post is part of a series showcasing Barcelona School of Economics master projects. The project is a required component of all BSE Master’s programs.

Abstract

Normalizing flows are an elegant approximation to generative modelling. It can be shown that learning a probability distribution of a continuous variable X is equivalent to learning a mapping f from the domain where X is defined to Rn is such that the final distribution is a Gaussian. In “Glow: Generative flow with invertible 1×1 convolutions,” Kingma et al introduced the Glow model. Normalizing flows arrange the latent space in such a way that feature additivity is possible, allowing synthetic image generation. For example, it is possible to take the image of a person not smiling, add a smile, and obtain the image of the same person smiling. Using the CelebA dataset we report new experimental properties of the latent space such as specular images and linear discrimination. Finally, we propose a mathematical framework that helps to understand why feature additivity works.

Conclusions

Generative Models for Deep Fake generation sit in between Engineering, Mathematics and Art. Trial and error is key to finding solutions to these types of problems. Theoretical grounding might only come afterwards. But when it does, it is simply amazing. By experimenting with normalizing flows we found properties of the latent space that have helped us create a mathematical model that explains why feature additivity works.

Connect with the author

About the BSE Master’s Program in Data Science Methodology

Deep Vector Autoregression for Macroeconomic Data

Data Science master project by Marc Agustí, Patrick Altmeyer, and Ignacio Vidal-Quadras Costa ’21

Photo by Uriel SC on Unsplash

Editor’s note: This post is part of a series showcasing Barcelona School of Economics master projects. The project is a required component of all BSE Master’s programs.

Abstract

Vector autoregression (VAR) models are a popular choice for forecasting of macroeconomic time series data. Due to their simplicity and success at modelling the monetary economic indicators VARs have become a standard tool for central bankers to construct economic forecasts. Impulse response functions can be readily retrieved and are used extensively to investigate the monetary transmission mechanism. In light of the recent advancements in computational power and the development of advanced machine learning and deep learning algorithms we propose a simple way to integrate these tools into the VAR framework.

This paper aims to contribute to the time series literature by introducing a ground-breaking methodology which we refer to as DeepVector Autoregression (Deep VAR). By fitting each equation of the VAR system with a deep neural network, the Deep VAR outperforms the VAR in terms of in-sample fit, out-of-sample fit and point forecasting accuracy. In particular, we find that the Deep VAR is able to better capture the structural economic changes during periods of uncertainty and recession.

Conclusions

To assess the modelling performance of Deep VARs compared to linear VARs we investigate a sample of monthly US economic data in the period 1959-2021. In particular, we look at variables typically analysed in the context of the monetary transmission mechanism including output, inflation, interest rates and unemployment.

Our empirical findings show a consistent and significant improvement in modelling performance associated with Deep VARs. Specifically, our proposed Deep VAR produces much lower cumulative loss measures than the VAR over the entire period and for all of the analysed time series. The improvements in modelling performance are particularly striking during subsample periods of economic downturn and uncertainty. This appears to confirm or initial hypothesis that by modelling time series through Deep VARs it is possible to capture complex, non-linear dependencies that seem to characterize periods of structural economic change.

Chart shows that improvement in performance of Deep VAR over VAR model
Credit: the authors

When it comes to the out-of-sample performance, a priori it may seem that the Deep VAR is prone to overfitting, since it is much less parsimonious that the conventional VAR. On the contrary, we find that by using default hyperparameters the Deep VAR clearly dominates the conventional VAR in terms of out-of-sample prediction and forecast errors. An exercise in hyperparameter tuning shows that its out-of-sample performance can be further improved by appropriate regularization through adequate dropout rates and appropriate choices for the width and depth of the neural. Interestingly, we also find that the Deep VAR actually benefits from very high lag order choices at which the conventional VAR is prone to overfitting.

In summary, we provide solid evidence that the introduction of deep learning into the VAR framework can be expected to lead to a significant boost in overall modelling performance. We therefore conclude that time series econometrics as an academic discipline can draw substantial benefits from further work on introducing machine learning and deep learning into its tool kit.

We also point out a number of shortcomings of our paper and proposed Deep VAR framework, which we believe can be alleviated through future research. Firstly, policy-makers are typically concerned with uncertainty quantification, inference and overall model interpretability. Future research on Deep VARs should therefore address the estimation of confidence intervals, impulse response functions as well as variance decompositions typically analysed in the context of VAR models. We point to a number of possible avenues, most notably Monte Carlo dropout and a Bayesian approach to modelling deep neural networks. Secondly, in our initial paper we benchmarked the Deep VAR only against the conventional VAR. In future work we will introduce other non-linear approaches to allow for a fairer comparison.

Code

To facilitate further research on Deep VAR, we also contribute a companion R package deepvars that can be installed from GitHub. We aim to continue working on the package as we develop our research further and want to ultimately move it onto CRAN. For any package related questions feel free to contact Patrick, who authored and maintains the package. The is also a paper specific GitHub repository that uses the deepvars package.

Connect with the authors

About the BSE Master’s Program in Data Science Methodology

Individual recourse for Black Box Models

Explained intuitively by Patrick Altmeyer (Finance ’18, Data Science ’21) through a tale of cats and dogs

Is artificial intelligence (AI) trustworthy? If, like me, you have recently been gobsmacked by the Netflix documentary Coded Bias, then you were probably quick to answer that question with a definite “no”. The show documents the efforts of a group of researchers headed by Joy Buolamwini, that aims to inform the public about the dangers of AI.

One particular place where AI has already wreaked havoc is automated decision making. While automation is intended to liberate decision making processes of human biases and judgment error, it all too often simply encodes these flaws, which at times leads to systematic discrimination of individuals. In the eyes of Cathy O’Neil, another researcher appearing on Coded Bias, this is even more problematic than discrimation through human decision makers because “You cannot appeal to [algorithms]. They do not listen. Nor do they bend.” What Cathy is referring to here is the fact that individuals who are at the mercy of automated decision making systems usually lack the necessary means to challenge the outcome that the system has determined for them. 

In my recent post on Towards Data Science,  I look at a novel algorithmic solution to this problem. The post is based primarily on a paper by Joshi et al. (2019) in which the authors develop a simple, but ingenious idea: instead of concerning ourselves with interpretability of black-box decision making systems (DMS), how about just providing individuals with actionable recourse to revise undesirable outcomes? Suppose for example that you have been rejected from your dream job, because an automated DMS has decided that you do not meet the shortlisting criteria for the position. Instead of receiving a standard rejection email, would it not be more helpful to be provided with a tailored set of actions you can take in order to be more successful on your next attempt? 

The methodology proposed by Joshi et al. (2019) and termed REVISE is an attempt to put this idea into practice. For my post I chose a more light-hearted topic than job rejections to illustrate the approach. In particular, I demonstrate how REVISE can be used to provide individual recourse to Kitty 🐱, a young cat that identifies as a dog. Based on information about her long tail and short overall height, a linear classifier has decided to label Kitty as a cat along with all the other cats that share similar attributes (Figure below). REVISE sends Kitty on the shortest possible route to being classified as a dog 🐶 . She just needs to grow a few inches and fold up her tail (Figure below).

The following summary should give you some flavour of how the algorithm works:

  1. Initialise x, that is the attributes that will be revised recursively. Kitty’s original attributes seem like a reasonable place to start.
  2. Through gradient descent recursively revise x until g(x*)=🐶. At this point the descent terminates since for these revised attributes the classifier labels Kitty as a dog.
  3. Return x*-x, that is the individual recourse for Kitty.
Animation illustrates how Kitty crosses the decision boundary
The simplified REVISE algorithm in action: how Kitty crosses the decision boundary by changing her attributes. Regularisation with respect to the distance penalty increases from top left to bottom right. Image by author.

This illustrative example is of course a bit silly and should not detract from the fact that the potential real-world use cases of the algorithm are serious and reach many domains. The work by Joshi et al. adds to a growing body of literature that aims to make AI more trustworthy and transparent. This will be decisive in applications of AI to domains like Economics, Finance and Public Policy, where decision makers and individuals rightfully insist on model interpretability and explainability. 

Further reading

The article was featured on TDS’ Editor’s Picks and has been added to their Model Interpretability column. This link takes you straight to the publication. Readers with an appetite for technical details around the implementation of stochastic gradient descent and the REVISE algorithm in R may also want to have a look at the original publication on my personal blog.

Connect with the author

portrait


Following his first Master’s at Barcelona GSE (Finance Program), Patrick Altmeyer worked as an economist for the Bank of England for two years. He is currently finishing up the Master’s in Data Science at Barcelona GSE.

Upon graduation Patrick will remain in academia to pursue a PhD in Trustworthy Artificial Intelligence at Delft University of Technology.


How we used Bayesian models to balance customer experience and courier earnings at Glovo

Javier Mas Adell ’17 (Data Science)

Neon sign depicts Bayes' Theorem

Glovo is a three-sided marketplace composed of couriers, customers, and partners. Balancing the interests of all sides of our platform is at the core of most strategic decisions taken at Glovo. To balance those interests optimally, we need to understand quantitatively the relationship between the main KPIs that represent the interests of each side.

I recently published an article on Glovo’s Engineering blog where I explain how we used Bayesian modeling to help us tackle the modeling problems we were facing due to the inherent heterogeneity and volatility of Glovo’s operations. The example in the article talks about balancing interests on two of the three sides of our marketplace: the customer experience and courier earnings.

The skillset I developed during the Barcelona GSE Master’s in Data Science is what’s enabled me to do work like this that requires knowledge of machine learning and other fields like Bayesian statistics and optimization.

Connect with the author

portrait

Javier Mas Adell ’17 is Lead Data Scientist at Kannact. He is an alum of the Barcelona GSE Master’s in Data Science.

Stop dropping outliers, or you might miss the next Messi!

Jakob Poerschmann ’21 explains how to teach your regression the distinction between relevant outliers and irrelevant noise

Jakob Poerschmann ’21 (Data Science) has written an article called “Stop Dropping Outliers! 3 Upgrades That Prepare Your Linear Regression For The Real World” that was recently posted on Towards Data Science.

The real world example he uses to set up the piece will resonate with every fan of FC Barcelona (and probably scare them, too):

You are working as a Data Scientist for the FC Barcelona and took on the task of building a model that predicts the value increase of young talent over the next 2, 5, and 10 years. You might want to regress the value over some meaningful metrics such as the assists or goals scored. Some might now apply this standard procedure and drop the most severe outliers from the dataset. While your model might predict decently on average, it will unfortunately never understand what makes a Messi (because you dropped Messi with all the other “outliers”).

The idea of dropping or replacing outliers in regression problems comes from the fact that simple linear regression is comparably prone to extremes in the data. However, this approach would not have helped you much in your role as Barcelona’s Data Scientist. The simple message: Outliers are not always bad!

Dig into the full article to find out how to prepare your linear regression for the real world and avoid a tragedy like this one!

Connect with the author

portrait

Jakob Poerschmann ’21 is student in the Barcelona GSE Master’s in Data Science.

Data Science team “Non-Juvenile Asymptotics” wins 3rd prize in annual Novartis Datathon

Patrick Altmeyer, Eduard Gimenez, Simon Neumeyer and Jakob Poerschmann ’21 competed against 57 teams from 14 countries.

Screenshot of team members on videoconference
Members of the “Non-Juvenile Asymptotics” Eduard Gimenez, Patrick Altmeyer, Simon Neumeyer and Jakob Poerschmann, all Barcelona GSE Data Science Class of 2021

The Novartis Datathon is a Data Science competition taking place annually, usually in Barcelona. In 2020, the Barcelona GSE team “Non-Juvenile Asymptotics” consisting of Eduard Gimenez, Patrick Altmeyer, Simon Neumeyer and Jakob Poerschmann won third place after a fierce competition against 57 teams from 14 countries all over the globe. While the competition is usually hosted in Barcelona, the Covid-friendly version was fully remote. Nevertheless, the increased diversity of teams clearly made up for the missed out atmosphere.

This year’s challenge: predict the impact of generic drug market entry

The challenge of interest concerned predicting the impact of generic drug market entry. The risk of losing ground against cheaper drug replicates once the patent protection runs out is evident for pharmaceutical companies. The solutions developed helped solving exactly this problem, making drug development much easier to plan and calculate.

While the problem could have been tackled in various different ways, the Barcelona GSE team focused on initially developing a solid modeling framework. This represented a risky extra effort in the beginning. In fact more than half of the competition period passed without any forecast submission by the Barcelona GSE team. However, the initial effort clearly paid off: as soon as the obstacle was overcome, the “Non-Juvenile Asymptotics” were able to benchmark multiple models at rocket speed.

Fierce competition until the very last minute

The competition was a head-to-head race until the last minute. Still in first place until minutes before the final deadline, the predictions of two teams from Hungary and Spain ended up taking the lead by razor sharp margins.

Congratulations to the winners!!!

Group photo of the team outside the entrance of Universitat Pompeu Fabra
The team at Ciutadella Campus (UPF)

Connect with the team

Tracking the Economy Using FOMC Speech Transcripts

Data Science master project by Laura Battaglia and Maria Salunina ’20

Editor’s note: This post is part of a series showcasing Barcelona School of Economics master projects. The project is a required component of all BSE Master’s programs.

Abstract

In this study, we propose an approach for the extraction of a low-dimensional signal from a collection of text documents ordered over time. The proposed framework foresees the application of Latent Dirichlet Allocation (LDA) for obtaining a meaningful representation of documents as a mixture over a set of topics. Such representations can then be modeled via a Dynamic Linear Model (DLM) as noisy realisations of a limited number of latent factors that evolve with time. We apply this approach to Federal Open Market Committee (FOMC) speech transcripts for the period of Greenspan presidency. This study serves as exploratory research for the investigation into how unstructured text data can be incorporated into economic modeling. In particular, our findings point at the fact that a meaningful state-of-the-world signal can be extracted from expert’s language, and pave the way for further exploration into the building of macroeconomic forecasting models, and in general into the usage of variation in language for learning about latent economic conditions.

Key findings

In our paper, we develop a sequential approach for the extraction of a low-dimensional signal from a collection of documents ordered over time. We apply this framework to the US Fed’s FOMC speech transcripts for the period 08-1986 to 01-2006. We retrieve estimates for a single latent factor, that seem to track fairly well a specific set of topics connected with risk, uncertainty, and expectations. Finally, we find a remarkable correspondence between this factor and the Economic Policy Uncertainty Indices for United States.

figure
figure

Connect with the authors

About the BSE Master’s Program in Data Science Methodology

Structure and power dynamics in labour flow and company control networks in the UK

Data Science master project by Áron Pap ’20

Droplets of dew collect on a spider web
Photo by Nathan Dumlao on Unsplash

Editor’s note: This post is part of a series showcasing Barcelona School of Economics master projects. The project is a required component of all BSE Master’s programs.

Abstract

In this thesis project I analyse labour flow networks, considering both undirected and directed configurations, and company control networks in the UK. I observe that these networks exhibit characteristics that are typical of empirical networks, such as heavy-tailed degree distribution, strong, naturally emerging communities with geo-industrial clustering and high assortativity. I also document that distinguishing between the type of investors of firms can help to better understand their degree centrality in the company control network and that large institutional entities having significant and exclusive control in a firm seem to be responsible for emerging hubs in this network. I also devise a simple network formation model to study the underlying causal processes in this company control network.

figure

Conclusion and future research

Intriguing empirical patterns and a new stylized fact are documented during the study of the company control network, since there is suggestive evidence that the types and number of investors are strongly associated with how “interconnected” a firm is in the company control network. Based on the empirical data it also seems that the largest institutional investors mainly seek opportunities where they can have significant control without sharing it with other dominant players. Thus the most “interconnected”/central firms in the company control network are the ones who can maintain this power balance in their owner structure. 

The devised network formation model helps to better understand the potential underlying mechanisms for the empirically observed stylized facts about the company control network. I carry out numerical simulations, sensitivity analysis and also calibrate parameters of the model using Bayesian optimization techniques to match the empirical results. However, these results could be “fine-tuned” at different stages further, in order to have a better empirical fit. First, the network formation model could be enhanced to represent more complex agent interactions and decisions. But also, the model calibration method could be extended to include more parameters and a larger valid search space for each of those parameters.

This project could also benefit from improvements to the utilised data. For example more granular data on the geographical regions could help to understand the different parts of London more and to have a more detailed view of economic hubs in the UK. Moreover, the current data source provides a static snapshot of the ownership and control structure of firms. Panel data on this front could enhance the analysis of the company control network, numerous experiments related to temporal dynamics could be carried out, for example link prediction or testing whether investors follow some kind of “preferential attachment” rules when acquiring significant control in firms.

Connect with the author

Áron Pap, Visiting Student at The Alan Turing Institute

About the BSE Master’s Program in Data Science Methodology