How to detect drift with Evidently and MLFlow

How to detect drift with Evidently and MLFlow

Data Drift

Data drift, also known as concept drift, refers to the change in patterns of data over time. In the context of machine learning, data drift happens when the statistical properties of the target variable, which the model is trying to predict, change in the unseen data over time.

This change in data patterns can lead to a degradation of model performance because the assumptions that the model learned during training no longer hold. For instance, a model trained to predict customer churn based on historical data may start to perform poorly if the behavior of customers changes significantly due to new market conditions or changes in the company's policies.

There are several types of data drift:

  1. Sudden Drift: This is when the data distribution changes abruptly. This could be due to a change in data collection, a change in policy, or a sudden shift in user behavior.
  2. Incremental Drift: This is a slow and gradual change in data distribution over time. It can be challenging to detect because it happens slowly.
  3. Seasonal Drift: This type of drift is predictable and cyclical. It's often found in data related to fields like retail, finance, and weather where there are regular and predictable changes.

Detecting data drift can be challenging because it requires constant monitoring of the model's input and output data. Some indicators of data drift include a decrease in model performance, an increase in the number of errors, or a change in the distribution of predictions.

MLflow is an open-source platform that helps manage the end-to-end machine learning lifecycle. It includes tools for experiment tracking, model packaging, reproducibility, deployment, and a central model registry. MLflow is designed to work with any machine learning library and algorithm, simplifying the management of ML projects. You can find more about MLflow on their official website.

Evidently

To keep watch on data drift, monitoring model performance and taking precautionary measures becomes a need of time. Evidently is an open source python library that helps to do most of this.

Evidently works with tabular and text data and helps throughout models lifecycle with its reports, tests and monitoring.

For data-drift detection, Evidently has a set of statistical tests and default thresholds depending on type of feature (numeric or categorical). It also allows users to define custom drift detection methods and thresholds. It produces reports that give feature level as well as dataset level data-drift insights. Reports can be visualized as html or further used as json. It also has the capability to integrate with MLops tools like Airflow, MLflow, Metaflow etc.

In this Blog, the attempt is to perform data-drift analysis on a sample dataset and to integrate the evidently output with MLflow in a custom way.

Installation and Setup

For the purpose, we need to install and import the libraries like numpy, pandas, evidently, mlflow and datetime.

import numpy as n
import pandas as pd
from evidently.pipeline.column mapping import ColumnMapping
from evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab
import mlflow
from mlflow. tracking import MlflowClient
from datetime import datetimep        

Dataset

For this experiment, let’s pick up the mobile price prediction dataset with limited features. Features like battery_power, clock_speed, int_memory, mobile_wt, n_cores, ram are the continuous and numerical whereas dual_sim, four_g are the categorical ones. The dataset is divided into two equal halves as reference data (df_ref) and current data (df_curr).

df = pd. read csv ('mobile price.csv'
df ref = df.loc[:500,:]
df curr = df. loc [500:,:J.reset index (drop=True))        

A drift is introduced in numeric features battery_power, ram and categorical feature dual_sim of the current dataset.

Code

Dataset and date variables are defined for the naming conventions of the drift reports and MLflow experiment / runs.

dataset= 'mobile price
2 date = datetime.now().strftime ('Sy-sm-sd SH:%M:85')'        

Drift analysis can be done only on the features which are common to reference and current dataset. Also the column mapping is necessary for performing suitable statistical tests to calculate drift. Columns are mapped as numerical_features and categorical_features. 

We are using Dashboard with DataDriftTab to calculate covariate drift (i.e. changes in distribution of independent features). It requires reference data, current data and column mapping.

common features = [feature for feature in list(df ref.columns) if feature in list(df curr. columns)
column mapping = ColumnMapping()
column mapping categorical features = ['dual sim', 'four g'1
column mapping.numerical features = ['battery power', 'clock speed' 'int memory' 'mobile wt' 'n cores' 'ram'!
covariate drift report = Dashboard (tabs= (DataDriftTab()])
covariate drift report. calculate(df ref, df curr, column mapping=column mapping)
covariate output = list(covariate drift report.analyzers results. values ())[0]]        

Continue here

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics