How to choose model evaluation metrics

Ajit Jaokar

Published Sep 22, 2024

For my students, I created the below with the help of chatGPT on how to choose model evaluation metrics. Each metric is useful depending on what you’re trying to achieve with the model. For instance, recall is more important in medical diagnoses, while precision is more important in fraud detection.

Regression Model Evaluation Metrics:

Regression models predict continuous values (like predicting house prices).

Mean Absolute Error (MAE):

The average of the absolute differences between the predicted and actual values.

Example use case: If you're predicting housing prices, MAE gives a straightforward number showing the average price difference between the predicted price and the actual price. It's best used when you want to understand the overall error in simple dollar terms.

Mean Squared Error (MSE):

The average of the squared differences between the predicted and actual values. Squaring gives more weight to larger errors.

Example use case: If you're predicting electricity consumption, where large prediction errors (underestimating or overestimating) can be costly, MSE helps highlight those large errors. It's best used when large errors are more problematic.

Root Mean Squared Error (RMSE):

The square root of MSE. It brings the unit back to the original scale of the prediction.

Example use case: Similar to MSE but easier to interpret because it's in the same units as the predicted value. Best used for comparing models where you want to understand how far off, on average, your predictions are in practical terms (e.g., predicting salaries).

R-squared (R²):

A measure of how well the predictions fit the actual data. It ranges from 0 to 1, where 1 means perfect fit.

Example use case: If you’re predicting the amount of rainfall based on humidity and temperature, R² tells you how much of the variation in rainfall is explained by the model. Best used to evaluate overall model fit.

Mean Absolute Percentage Error (MAPE):

The average of the absolute percentage errors between predicted and actual values.

Example use case: If you’re forecasting sales for a store, MAPE tells you the error as a percentage, which can be more understandable to non-technical stakeholders. It’s best used when you need to express error in relative terms (e.g., “we were off by 10% on average”).

Classification Model Evaluation Metrics:

Classification models predict categories or labels (like predicting whether an email is spam or not).

Accuracy:

The percentage of correct predictions out of all predictions.

Example use case: In a simple spam detection system where the cost of getting a few wrong is not very high, accuracy works well as a metric to check how often the system is right. Best used when the data is balanced (i.e., roughly equal numbers of classes like spam vs. not spam).

Precision:

Out of all the predictions made for a positive class (e.g., "spam"), how many were correct.

Example use case: If you're detecting fraud in financial transactions, you want to avoid falsely labeling regular transactions as fraud. Precision helps you focus on the number of correct fraud detections out of all fraud predictions. Best used when the cost of false positives is high.

Recall (Sensitivity):

Out of all the actual positives (e.g., "spam"), how many were correctly identified by the model.

Example use case: If you're building a medical test to detect cancer, recall tells you how many actual cancer cases were correctly identified. It's best used when missing positives (false negatives) is dangerous or costly.

F1 Score:

The harmonic mean of precision and recall, balancing the two.

Example use case: If you're building a fraud detection system where both false positives and false negatives are important, the F1 score gives a balanced evaluation. It’s best used when you need a trade-off between precision and recall.

AUC-ROC (Area Under the Receiver Operating Characteristic Curve):

A measure of how well a model distinguishes between classes. It ranges from 0.5 (random guess) to 1 (perfect classification).

Example use case: If you're classifying whether customers will churn (leave your service), AUC-ROC helps you measure how well the model separates churners from non-churners. Best used when you want to evaluate how well the model ranks predictions.

Confusion Matrix:

A table showing true positives, true negatives, false positives, and false negatives.

Example use case: In an email spam filter, the confusion matrix helps you see exactly how many spam emails were correctly identified, how many legitimate emails were misclassified, etc. Best used for a detailed breakdown of all prediction results.

How to choose the model evaluation metrics(flow)

Given a machine learning or a deep learning problem, design a series of rules/choices for choosing an appropriate metric

Choosing an appropriate evaluation metric for a machine learning or deep learning problem depends on several factors, such as the problem type, the goals of the model, the distribution of the data, and the importance of errors. Here’s a series of rules/choices to help guide you through selecting the right metric for your problem.

1. Determine the Type of Problem (Regression or Classification)

If the target output is continuous (e.g., predicting prices, temperatures), you’re dealing with a regression problem.

If the target output is categorical (e.g., predicting spam vs. non-spam, or customer churn), you’re dealing with a classification problem.

2. For Regression Problems

Flowchart Summary of Choices:

Is your problem a regression or classification task?

Regression → MAE, MSE, RMSE, MAPE, R²

Classification → Accuracy, Precision, Recall, F1 Score, AUC-ROC

For regression problems:

Large errors more important? → MSE or RMSE

All errors equally important? → MAE

Error in percentages? → MAPE

Explaining variance? → R²

For classification problems:

Balanced data? → Accuracy

Imbalanced data? → Precision, Recall, F1 Score, AUC-ROC

False positives costly? → Precision

False negatives costly? → Recall

Need balance? → F1 Score

Ranking predictions important? → AUC-ROC

Multi-class problems:

Balanced data? → Accuracy

Imbalanced data? → Macro-Averaged F1 Score

Artificial Intelligence

113,191 followers

+ Subscribe

Pankaj Nagpal, Ph.D.

Analytics & AI | Technology | ESG | Asia Pacific | IITR

1mo

Nice summary on model evaluation metrics- I would add experience with the specific context.

2 Reactions

Mustafa Aldemir

Director of AI @ PwC | ex-AWS | ex-Intel | tutor @ Oxford University

1mo

I love following your newsletter because it both gives valuable insights about the latest advancements in the field and refreshes my knowledge about the basics of ML.

How to choose model evaluation metrics

Ajit Jaokar

Regression Model Evaluation Metrics:

Mean Absolute Error (MAE):

Mean Squared Error (MSE):

Root Mean Squared Error (RMSE):

R-squared (R²):

Mean Absolute Percentage Error (MAPE):

Classification Model Evaluation Metrics:

Accuracy:

Precision:

Recall (Sensitivity):

F1 Score:

AUC-ROC (Area Under the Receiver Operating Characteristic Curve):

Confusion Matrix:

How to choose the model evaluation metrics(flow)

1. Determine the Type of Problem (Regression or Classification)

2. For Regression Problems

Recommended by LinkedIn

3. For Classification Problems

4. For Multi-class Classification Problems

5. For Time-Series Forecasting Problems

6. Consider Stakeholders’ Needs

7. For Deep Learning Models

Flowchart Summary of Choices:

Artificial Intelligence

113,191 followers

More articles by this author

Insights from the community

Others also viewed

AI aptitude could be a career differentiator

The Future of Work according to ChatGPT: How Generative AI is Transforming Job Opportunities

ChatGPT and Artificial Intelligence: Assessing the Impact on Employment

Can AI be used for our human and professional development, as well as for entertainment?

Unlock Your Research Potential with ChatGPT

What is the impact of AI on people’s employability and what governments could do? I asked ChatGPT 3 questions and here are the answers provided:

7 Strategies to Future-Proof Your Career in the Age of AI and Increase Your Income

Ten Suggested Uses of ChatGPT for HR Professionals

Data mining and data enrichment with ChatGPT

Explore topics

Regression Model Evaluation Metrics:

Mean Absolute Error (MAE):

Mean Squared Error (MSE):

Root Mean Squared Error (RMSE):

R-squared (R²):

Mean Absolute Percentage Error (MAPE):

Classification Model Evaluation Metrics:

Accuracy:

Precision:

Recall (Sensitivity):

F1 Score:

AUC-ROC (Area Under the Receiver Operating Characteristic Curve):

Confusion Matrix:

How to choose the model evaluation metrics(flow)

1. Determine the Type of Problem (Regression or Classification)

2. For Regression Problems

Recommended by LinkedIn

3. For Classification Problems

4. For Multi-class Classification Problems

5. For Time-Series Forecasting Problems

6. Consider Stakeholders’ Needs

7. For Deep Learning Models

Flowchart Summary of Choices:

Artificial Intelligence

113,191 followers

The chatGPT baby tool building paradigm: A new set of skills to develop in the enterprise

Nov 3, 2024

Mathematical thinking vs Physics based thinking - early adopter version of my book

Oct 31, 2024

The new roles for the developer in AI assisted workflows for the github copilot

Oct 30, 2024

An easy way to learn Python using ChatGPT

Oct 29, 2024

From Prompt Engineering to Guide Engineering - implications of anthropic computer use

Oct 26, 2024

AI for Engineering Sciences: Developing the engineering mindset

Oct 25, 2024

Our new book - AI-Assisted Programming for Web and Machine Learning

Oct 25, 2024

Why is maths for machine learning made to appear so complex?

Oct 20, 2024

AI Research Perspectives from the AI community - relating to my course at the University of Oxford

Oct 19, 2024

Early adopter version of my book - explaining machine learning algorithms as a hidden function that maps x and y

Oct 17, 2024

Insights from the community

Others also viewed

AI aptitude could be a career differentiator

The Future of Work according to ChatGPT: How Generative AI is Transforming Job Opportunities

ChatGPT and Artificial Intelligence: Assessing the Impact on Employment

Can AI be used for our human and professional development, as well as for entertainment?

Unlock Your Research Potential with ChatGPT

What is the impact of AI on people’s employability and what governments could do? I asked ChatGPT 3 questions and here are the answers provided:

7 Strategies to Future-Proof Your Career in the Age of AI and Increase Your Income

Ten Suggested Uses of ChatGPT for HR Professionals

Data mining and data enrichment with ChatGPT

Explore topics