Kaggle evaluation metrics used for regression problems
Posted on Sat 16 February 2019
• 7 min read
While crafting machine learning model there is always need to asses its performance. When trying multiple models or hyper parameter tuning it is useful to compare different approaches and choose the best one. The sklearn.metrics provides plethora of metrics for suitable for distinct purposes.
In this series of posts I will discuss four groups of common machine learning tasks each requires specific metrics:
 Regression  predict value of one or more variables that are continuous, e.g. predict stock price of given asset or predict temperature for next day.
 Binary classification  assign sample to one of two classes  example: classify image as one containing "cat" or "dog"
 Multiple class classification  assign sample to one of many classes example: classify new article to category "sport", "politics", "economy", "popculture",...
 Other
The Kaggle competitions give insight into approach taken by Kaggle team to select best evaluation metrics for given task. There use to be Kaggle wiki under containing short definitions of metrics used in Kaggle competitions but it is not available anymore. In this post we will look closer at the first group and explain few model evaluation metrics used in regression problems. Here metrics that are discussed in this post.
 Absolute Error  AE
 Mean Absolute Error  MAE
 Weighted Mean Absolute Error  WMAE
 Pearson Correlation Coefficient
 Spearman’s Rank Correlation
 Root Mean Squared Error  RMSE
 Root Mean Squared Logarithmic Error  RMSLE
 Mean Columnwise Root Mean Squared Error  MCRMSE
 References
Absolute Error  AE
The sum of the absolute value of each individual error.
Where:
\(\mathrm{AE} = e_i = y_i\hat{y_i}\),
\(n\)  number test of samples,
\(y_i\)  actual variable value,
\(\hat{y}_i\)  predicted variable value.
MAE can cause notable difference between public and private leaderboard calculations. One drawback of the Absolute Error metrics is that direct comparison of the metrics for model used to predict variables on different scales is not possible. E.g. when using model to financial predictions of S&P 500 index and using the same model to predict value of Microsoft stock price we cannot compare their performance using this metrics since units and ranges are different. The S&P 500 is expressed in points and stock price of asset is expressed in dollars. In this situation one can use (percentage error) to get evaluation metrics in common scale.
Exemplary competition using Mean Absolute Error for model evaluation:
 Forecast Eurovision Voting  This competition requires contestants to forecast the voting for this year's Eurovision Song Contest in Norway on May 25th, 27th and 29th.
Mean Absolute Error  MAE
Mean of the absolute value of each individual error.
The mean absolute error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. The mean absolute error is given by formula:
Where:
\(n\)  number test of samples,
\(y_i\)  actual variable value,
\(\hat{y}_i\)  predicted variable value.
see also paper: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance
Five exemplary competitions using Mean Absolute Error for model evaluation:

LANL Earthquake Prediction  Can you predict upcoming laboratory earthquakes?

PUBG Finish Placement Prediction  Can you predict the battle royale finish of PUBG Players?

Allstate Claims Severity  How severe is an insurance claim?

Loan Default Prediction  Imperial College London  Constructing an optimal portfolio of loans.

Finding Elo  Predict a chess player's FIDE Elo rating from one game.
Weighted Mean Absolute Error  WMAE
Weighted average of absolute errors.
WMAE can be used as evaluation tool for better assessing the model performance with respect to the goals of the application. For example, in the case of recommending books or movies it could be possible that the accuracy of the predictions varies when focusing on past or recent products. In this situation, it is not reasonable that every error were treated equally, so more stress should be put in recent items.
WMAE can be also useful as a diagnosis tool that, using a "magnifying lens", can help to identify those cases where an algorithm is having trouble with. The formula for calculating WMAE is:
where:
\(n\)  number test of samples,
\(w_i\)  weights for sample \(i\),
\(y_i\)  actual variable value,
\(\hat{y}_i\)  predicted variable value.
Two exemplary competitions using Weighted Mean Absolute Error for model evaluation:

The Winton Stock Market Challenge  Join a multidisciplinary team of research scientists.

Walmart Recruiting  Store Sales Forecasting  Use historical markdown data to predict store sales.
Pearson Correlation Coefficient
Covariance of the two variables divided by the product of the standard deviation of each data sample.
It is the normalization of the covariance between the two variables to give an interpretable score. The Pearson correlation coefficient can be used to summarize the strength of the linear relationship between two data samples. The formula for calculating Pearson correlation coefficient is:
where:
\(cov()\)  is covariation function,
\(std()\)  is standard deviation
\(y_i\)  actual variable value,
\(\hat{y}_i\)  predicted variable value
\(p\)  Pearson correlation coefficient.
The use of mean and standard deviation in the calculation requires data samples to have a Gaussian or Gaussianlike distribution.
Exemplary competition using Pearson Correlation Coefficient for model evaluation:
 Merck Molecular Activity Challenge  Help develop safe and effective medicines by predicting molecular activity.
Spearman’s Rank Correlation
Covariance of the two variables converted to ranks divided by the product of the standard deviation of ranks for each variable.
Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables. The two variables being considered may have a nonGaussian distribution.
The Spearman’s correlation coefficient can be used to summarize the nonlinear relation between the two data samples. Raw scores \(y_i\) and \(\hat{y}_i\) are converted to ranks respectively: \(ry_i\) and \(\hat{ry}_i\). The formula for calculating Spearman's rank correlation coefficient is:
where:
\(cov()\)  is covariation function,
\(std()\)  is standard deviation,
\(ry_i\)  rank of variable value,
\(\hat{ry}_i\)  rank of predicted variable value,
\(r\)  Spearman's correlation coefficient.
Exemplary competition using Spearman’s Rank Correlation for model evaluation:
 Draper Satellite Image Chronology](https://www.kaggle.com/c/drapersatelliteimagechronology#evaluation)  Can you put order to space and time?
Root Mean Squared Error  RMSE
The square root of the mean/average of the square of all of the error.
The use of RMSE is very common and it makes an excellent general purpose error metric for numerical predictions. Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors. The formula for calculating RMSE is:
where:
\(n\)  number test of samples,
\(y_i\)  actual variable value,
\(\hat{y}_i\)  predicted variable value.
Five exemplary competition using Root Mean Squared Error for model evaluation:

Elo Merchant Category Recommendation  Help understand customer loyalty.

Google Analytics Customer Revenue Prediction  Predict how much GStore customers will spend.

House Prices: Advanced Regression Techniques  Predict sales prices and practice feature engineering, RFs, and gradient boosting.

Predict Future Sales  Final project for "How to win a data science competition" Coursera course.

New York City Taxi Fare Prediction  Can you predict a rider's taxi fare?
Root Mean Squared Logarithmic Error  RMSLE
Root mean squared error of variables transformed to logarithmic scale.
Where:
\(n\)  number of test samples,
\(\hat{y}_i\) is the predicted variable,
\(y_i\) is the actual variable,
\(log(x)\) is the natural logarithm of \(x\).
The RMSLE is higher when the discrepancies between predicted and actual values are larger. Compared to Root Mean Squared Error (RMSE), RMSLE does not heavily penalize huge discrepancies between the predicted and actual values when both values are huge. In this cases only the percentage differences matter (difference of variable logarithms is equivalent to ratio of variables).
Exemplary competition using Root Mean Squared Logarithmic Error for model evaluation:

Santander Value Prediction Challenge  Predict the value of transactions for potential customers.

Mercari Price Suggestion Challenge  Can you automatically suggest product prices to online sellers?

Recruit Restaurant Visitor Forecasting  Predict how many future visitors a restaurant will receive

New York City Taxi Trip Duration  Share code and data to improve ride time predictions

Sberbank Russian Housing Market  Can you predict realty price fluctuations in Russia’s volatile economy?
Mean Columnwise Root Mean Squared Error  MCRMSE
Errors of each kfold CV trials were averaged over n test samples across m target variables.
Note that expression under square root is RMSE, thus we can write:
Where:
\(m\)  number of predicted variables,
\(n\)  number of test samples,
\(y_{ij}\)  \(i\)th actual value of \(j\)th variable,
\(\hat{y}_{ij}\)  \(i\)th predicted value of \(j\)th variable.
Exemplary competition using Mean Columnwise Root Mean Squared Error for model evaluation:
 Africa Soil Property Prediction Challenge  Predict physical and chemical properties of soil using spectral measurements