How to choose model evaluation metrics

How to choose model evaluation metrics

For my students, I created the below with the help of chatGPT on how to choose model evaluation metrics. Each metric is useful depending on what you’re trying to achieve with the model. For instance, recall is more important in medical diagnoses, while precision is more important in fraud detection.

Regression Model Evaluation Metrics:

Regression models predict continuous values (like predicting house prices).

Mean Absolute Error (MAE):

The average of the absolute differences between the predicted and actual values.

Example use case: If you're predicting housing prices, MAE gives a straightforward number showing the average price difference between the predicted price and the actual price. It's best used when you want to understand the overall error in simple dollar terms.

Mean Squared Error (MSE):

The average of the squared differences between the predicted and actual values. Squaring gives more weight to larger errors.

Example use case: If you're predicting electricity consumption, where large prediction errors (underestimating or overestimating) can be costly, MSE helps highlight those large errors. It's best used when large errors are more problematic.

Root Mean Squared Error (RMSE):

The square root of MSE. It brings the unit back to the original scale of the prediction.

Example use case: Similar to MSE but easier to interpret because it's in the same units as the predicted value. Best used for comparing models where you want to understand how far off, on average, your predictions are in practical terms (e.g., predicting salaries).

R-squared (R²):

A measure of how well the predictions fit the actual data. It ranges from 0 to 1, where 1 means perfect fit.

Example use case: If you’re predicting the amount of rainfall based on humidity and temperature, R² tells you how much of the variation in rainfall is explained by the model. Best used to evaluate overall model fit.

Mean Absolute Percentage Error (MAPE):

The average of the absolute percentage errors between predicted and actual values.

Example use case: If you’re forecasting sales for a store, MAPE tells you the error as a percentage, which can be more understandable to non-technical stakeholders. It’s best used when you need to express error in relative terms (e.g., “we were off by 10% on average”).

Classification Model Evaluation Metrics:

Classification models predict categories or labels (like predicting whether an email is spam or not).

Accuracy:

The percentage of correct predictions out of all predictions.

Example use case: In a simple spam detection system where the cost of getting a few wrong is not very high, accuracy works well as a metric to check how often the system is right. Best used when the data is balanced (i.e., roughly equal numbers of classes like spam vs. not spam).

Precision:

Out of all the predictions made for a positive class (e.g., "spam"), how many were correct.

Example use case: If you're detecting fraud in financial transactions, you want to avoid falsely labeling regular transactions as fraud. Precision helps you focus on the number of correct fraud detections out of all fraud predictions. Best used when the cost of false positives is high.

Recall (Sensitivity):

Out of all the actual positives (e.g., "spam"), how many were correctly identified by the model.

Example use case: If you're building a medical test to detect cancer, recall tells you how many actual cancer cases were correctly identified. It's best used when missing positives (false negatives) is dangerous or costly.

F1 Score:

The harmonic mean of precision and recall, balancing the two.

Example use case: If you're building a fraud detection system where both false positives and false negatives are important, the F1 score gives a balanced evaluation. It’s best used when you need a trade-off between precision and recall.

AUC-ROC (Area Under the Receiver Operating Characteristic Curve):

A measure of how well a model distinguishes between classes. It ranges from 0.5 (random guess) to 1 (perfect classification).

Example use case: If you're classifying whether customers will churn (leave your service), AUC-ROC helps you measure how well the model separates churners from non-churners. Best used when you want to evaluate how well the model ranks predictions.

Confusion Matrix:

A table showing true positives, true negatives, false positives, and false negatives.

Example use case: In an email spam filter, the confusion matrix helps you see exactly how many spam emails were correctly identified, how many legitimate emails were misclassified, etc. Best used for a detailed breakdown of all prediction results.



How to choose the model evaluation metrics(flow)

Given a machine learning or a deep learning problem, design a series of rules/choices for choosing an appropriate metric

Choosing an appropriate evaluation metric for a machine learning or deep learning problem depends on several factors, such as the problem type, the goals of the model, the distribution of the data, and the importance of errors. Here’s a series of rules/choices to help guide you through selecting the right metric for your problem.

1. Determine the Type of Problem (Regression or Classification)

If the target output is continuous (e.g., predicting prices, temperatures), you’re dealing with a regression problem.

If the target output is categorical (e.g., predicting spam vs. non-spam, or customer churn), you’re dealing with a classification problem.

2. For Regression Problems

Decide if large errors matter more than small errors

If large errors are especially harmful (e.g., predicting house prices or hospital waiting times where huge deviations can be problematic), use Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) to give larger errors more weight.

If all errors, big or small, should be treated equally, use Mean Absolute Error (MAE), which averages the absolute difference between predicted and actual values.

Need to express error in percentage terms?

If your stakeholders are more interested in how far off the prediction is in relative terms (e.g., in sales forecasting or stock prediction), use Mean Absolute Percentage Error (MAPE), which shows errors as percentages.

Check if you care about how much variance is explained

If your goal is to explain how much of the variance in the data the model captures (e.g., in scientific research), use R-squared (R²).

3. For Classification Problems

Check if the data is balanced

If the classes are balanced (i.e., roughly equal numbers of positive and negative cases), Accuracy is a reasonable choice.

If the data is imbalanced (i.e., one class is far more common than the other, like in fraud detection or rare disease classification), accuracy can be misleading. Focus on other metrics like Precision, Recall, F1 Score, or AUC-ROC.

Is false positives or false negatives more costly?

If false positives (predicting something is positive when it isn’t) are more costly (e.g., flagging normal transactions as fraud), prioritize Precision. This metric minimizes the number of incorrect positive predictions.

If false negatives (failing to identify a positive case) are more costly (e.g., missing a cancer diagnosis), prioritize Recall. This metric minimizes the number of missed positive cases.

If you need a balance between false positives and false negatives (e.g., fraud detection or email spam classification), use the F1 Score, which balances Precision and Recall.

Do you care about ranking predictions?

If you care about the model's ability to rank predictions or if you're working with imbalanced data, use AUC-ROC (Area Under the Curve - Receiver Operating Characteristic). It’s particularly useful when you want to evaluate the trade-off between true positive and false positive rates.

4. For Multi-class Classification Problems

If you're dealing with a problem where there are more than two classes (e.g., predicting different species of animals or types of diseases), consider Accuracy if the classes are balanced.

If the classes are imbalanced, use Macro-Averaged F1 Score or Weighted F1 Score to take the imbalance into account.

Use a Confusion Matrix to get a detailed breakdown of how your model performs on each class, especially for imbalanced data.

5. For Time-Series Forecasting Problems

For time-series problems (e.g., predicting stock prices or sales), where the sequence of data matters, use Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), or Mean Absolute Percentage Error (MAPE) to capture the size of errors.

If you want to ensure the model predicts trend direction correctly (i.e., increasing vs. decreasing), consider using Directional Accuracy in combination with other metrics.

6. Consider Stakeholders’ Needs

If the stakeholders (business users, executives, clients) care more about easily understandable results, choose metrics like MAE or MAPE for regression, or Accuracy for classification, because they are intuitive.

If stakeholders are more technical and want insights into model performance nuances, use Precision, Recall, F1 Score, or AUC-ROC to highlight the trade-offs your model is making.

7. For Deep Learning Models

For deep learning models in tasks like image or text classification, metrics like Accuracy, Precision, Recall, and F1 Score are common, but make sure to use AUC-ROC if the data is imbalanced.

For deep learning models in regression tasks (e.g., predicting continuous values from images), use MSE or RMSE to capture model performance, especially when large errors are problematic.

Flowchart Summary of Choices:

Is your problem a regression or classification task?

Regression → MAE, MSE, RMSE, MAPE, R²

Classification → Accuracy, Precision, Recall, F1 Score, AUC-ROC

For regression problems:

Large errors more important? → MSE or RMSE

All errors equally important? → MAE

Error in percentages? → MAPE

Explaining variance? → R²

For classification problems:

Balanced data? → Accuracy

Imbalanced data? → Precision, Recall, F1 Score, AUC-ROC

False positives costly? → Precision

False negatives costly? → Recall

Need balance? → F1 Score

Ranking predictions important? → AUC-ROC

Multi-class problems:

Balanced data? → Accuracy

Imbalanced data? → Macro-Averaged F1 Score


Pankaj Nagpal, Ph.D.

Analytics & AI | Technology | ESG | Asia Pacific | IITR

1mo

Nice summary on model evaluation metrics- I would add experience with the specific context.

Mustafa Aldemir

Director of AI @ PwC | ex-AWS | ex-Intel | tutor @ Oxford University

1mo

I love following your newsletter because it both gives valuable insights about the latest advancements in the field and refreshes my knowledge about the basics of ML.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics