Stop dropping outliers, or you might miss the next Messi!

Jakob Poerschmann ’21 explains how to teach your regression the distinction between relevant outliers and irrelevant noise

Jakob Poerschmann ’21 (Data Science) has written an article called “Stop Dropping Outliers! 3 Upgrades That Prepare Your Linear Regression For The Real World” that was recently posted on Towards Data Science.

The real world example he uses to set up the piece will resonate with every fan of FC Barcelona (and probably scare them, too):

You are working as a Data Scientist for the FC Barcelona and took on the task of building a model that predicts the value increase of young talent over the next 2, 5, and 10 years. You might want to regress the value over some meaningful metrics such as the assists or goals scored. Some might now apply this standard procedure and drop the most severe outliers from the dataset. While your model might predict decently on average, it will unfortunately never understand what makes a Messi (because you dropped Messi with all the other “outliers”).

The idea of dropping or replacing outliers in regression problems comes from the fact that simple linear regression is comparably prone to extremes in the data. However, this approach would not have helped you much in your role as Barcelona’s Data Scientist. The simple message: Outliers are not always bad!

Dig into the full article to find out how to prepare your linear regression for the real world and avoid a tragedy like this one!

Connect with the author


Jakob Poerschmann ’21 is student in the Barcelona GSE Master’s in Data Science.

A Bayesian Search for the Needle in the Haystack

Master project by Timothée Stumpf-Fétizon. Barcelona GSE Master’s Degree in Data Science

Editor’s note: This post is part of a series showcasing Barcelona GSE master projects by students in the Class of 2015. The project is a required component of every master program.

Timothée Stumpf-Fétizon

Master’s Program:
Data Science

Paper Abstract:

I develop an extension to Monte Carlo methods that sample from large and complex model spaces. I assess the extension using a new and fully functional module for Bayesian model choice. In standard conditions, my extension leads to an increase of around 30 percent in sampling efficiency.

Presentation Slides:

This is work in progress and there is no telling whether the rule works better in all situations!

If you’re interested in using BMA in practice, you can fork the software on my github (working knowledge of Python required!)