Kaggle evaluation metrics used for regression problems

Posted on Sat 16 February 2019   •   7 min read

While crafting machine learning model there is always need to asses its performance. When trying multiple models or hyper parameter tuning it is useful to compare different approaches and choose the best one. The sklearn.metrics provides plethora of metrics for suitable for distinct purposes.

In this series of posts I will discuss four groups of common machine learning tasks each requires specific metrics:

  1. Regression - predict value of one or more variables that are continuous, e.g. predict stock price of given asset or predict temperature for next day.
  2. Binary classification - assign sample to one of two classes - example: classify image as one containing "cat" or "dog"
  3. Multiple class classification - assign sample to one of many classes example: classify new article to category "sport", "politics", "economy", "pop-culture",...
  4. Other

The Kaggle competitions give insight into approach taken by Kaggle team to select best evaluation metrics for given task. There use to be Kaggle wiki under containing short definitions of metrics used in Kaggle competitions but it is not available anymore. In this post we will look closer at the first group and explain few model evaluation metrics used in regression problems. Here metrics that are discussed in this post.

Absolute Error - AE

The sum of the absolute value of each individual error.

$$ \mathrm{AE} = \sum_{i=1}^n | y_i - \hat{y}_i | $$

Where:

\(\mathrm{AE} = |e_i| = |y_i-\hat{y_i}|\),

\(n\) - number test of samples,

\(y_i\) - actual variable value,

\(\hat{y}_i\) - predicted variable value.

MAE can cause notable difference between public and private leaderboard calculations. One drawback of the Absolute Error metrics is that direct comparison of the metrics for model used to predict variables on different scales is not possible. E.g. when using model to financial predictions of S&P 500 index and using the same model to predict value of Microsoft stock price we cannot compare their performance using this metrics since units and ranges are different. The S&P 500 is expressed in points and stock price of asset is expressed in dollars. In this situation one can use (percentage error) to get evaluation metrics in common scale.

Exemplary competition using Mean Absolute Error for model evaluation:

  • Forecast Eurovision Voting - This competition requires contestants to forecast the voting for this year's Eurovision Song Contest in Norway on May 25th, 27th and 29th.

Mean Absolute Error - MAE

Mean of the absolute value of each individual error.

The mean absolute error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes. The mean absolute error is given by formula:

$$ \mathrm{MAE} = \frac{1}{n}\sum_{i=1}^n \left| y_i - \hat{y_i}\right| =\frac{1}{n}\sum_{i=1}^n \left| e_i \right|. $$

Where:

\(n\) - number test of samples,

\(y_i​\) - actual variable value,

\(\hat{y}_i\) - predicted variable value.

see also paper: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance

Five exemplary competitions using Mean Absolute Error for model evaluation:

Weighted Mean Absolute Error - WMAE

Weighted average of absolute errors.

WMAE can be used as evaluation tool for better assessing the model performance with respect to the goals of the application. For example, in the case of recommending books or movies it could be possible that the accuracy of the predictions varies when focusing on past or recent products. In this situation, it is not reasonable that every error were treated equally, so more stress should be put in recent items.

WMAE can be also useful as a diagnosis tool that, using a "magnifying lens", can help to identify those cases where an algorithm is having trouble with. The formula for calculating WMAE is:

$$ \textrm{WMAE} = \frac{1}{n} \sum_{i=1}^n w_i | y_i - \hat{y}_i |, $$

where:

\(n\) - number test of samples,

\(w_i\) - weights for sample \(i\),

\(y_i\) - actual variable value,

\(\hat{y}_i\) - predicted variable value.

Two exemplary competitions using Weighted Mean Absolute Error for model evaluation:

Pearson Correlation Coefficient

Covariance of the two variables divided by the product of the standard deviation of each data sample.

It is the normalization of the covariance between the two variables to give an interpretable score. The Pearson correlation coefficient can be used to summarize the strength of the linear relationship between two data samples. The formula for calculating Pearson correlation coefficient is:

$$ p = \frac{cov(y_i, \hat{y}_i)}{std(y_i) std(\hat{y}_i)} $$

where:

\(cov()\) - is covariation function,

\(std()\) - is standard deviation

\(y_i\) - actual variable value,

\(\hat{y}_i\) - predicted variable value

\(p\) - Pearson correlation coefficient.

The use of mean and standard deviation in the calculation requires data samples to have a Gaussian or Gaussian-like distribution.

Exemplary competition using Pearson Correlation Coefficient for model evaluation:

Spearman’s Rank Correlation

Covariance of the two variables converted to ranks divided by the product of the standard deviation of ranks for each variable.

Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables. The two variables being considered may have a non-Gaussian distribution.

The Spearman’s correlation coefficient can be used to summarize the nonlinear relation between the two data samples. Raw scores \(y_i\) and \(\hat{y}_i\) are converted to ranks respectively: \(ry_i\) and \(\hat{ry}_i\). The formula for calculating Spearman's rank correlation coefficient is:

$$ r=\frac{cov(ry_i, \hat{ry}_i)}{std(ry_i)std(\hat{ry}_i)} $$

where:

\(cov()\) - is covariation function,

\(std()\) - is standard deviation,

\(ry_i\) - rank of variable value,

\(\hat{ry}_i\) - rank of predicted variable value,

\(r\) - Spearman's correlation coefficient.

Exemplary competition using Spearman’s Rank Correlation for model evaluation:

  • Draper Satellite Image Chronology](https://www.kaggle.com/c/draper-satellite-image-chronology#evaluation) - Can you put order to space and time?

Root Mean Squared Error - RMSE

The square root of the mean/average of the square of all of the error.

The use of RMSE is very common and it makes an excellent general purpose error metric for numerical predictions. Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors. The formula for calculating RMSE is:

$$ \mathrm{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $$

where:

\(n\) - number test of samples,

\(y_i\) - actual variable value,

\(\hat{y}_i\) - predicted variable value.

Five exemplary competition using Root Mean Squared Error for model evaluation:

Root Mean Squared Logarithmic Error - RMSLE

Root mean squared error of variables transformed to logarithmic scale.

$$ \mathrm{RMSLE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(log(\hat{y}_i + 1) - log(y_i + 1))^2} $$

Where:

\(n​\) - number of test samples,

\(\hat{y}_i\) is the predicted variable,

\(y_i\) is the actual variable,

\(log(x)\) is the natural logarithm of \(x\).

The RMSLE is higher when the discrepancies between predicted and actual values are larger. Compared to Root Mean Squared Error (RMSE), RMSLE does not heavily penalize huge discrepancies between the predicted and actual values when both values are huge. In this cases only the percentage differences matter (difference of variable logarithms is equivalent to ratio of variables).

Exemplary competition using Root Mean Squared Logarithmic Error for model evaluation:

Mean Columnwise Root Mean Squared Error - MCRMSE

Errors of each k-fold CV trials were averaged over n test samples across m target variables.

$$ MCRMSE = \frac{1}{m}\sum_{j=1}^{m}\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_ {ij}-\hat{y}_{ij})^2} $$

Note that expression under square root is RMSE, thus we can write:

$$ MCRMSE = \frac{1}{m}\sum_{j=1}^{m}RMSE_j $$

Where:

\(m\) - number of predicted variables,

\(n\) - number of test samples,

\(y_{ij}\) - \(i\)-th actual value of \(j​\)-th variable,

\(\hat{y}_{ij}\) - \(i\)-th predicted value of \(j\)-th variable.

Exemplary competition using Mean Columnwise Root Mean Squared Error for model evaluation:

References

  1. Kaggle wiki
  2. Beating Kaggle the easy way, page 43
  3. How to Use Correlation to Understand the Relationship Between Variables
  4. Mean Columnwise Root Mean Squared Error - google books
  5. Metrics to Understand Regression Models in Plain English: Part 1