ValidMind for model validation 3 — Developing a potential challenger model

Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger model and then pass your model and its predictions to ValidMind.

A challenger model is an alternate model that attempts to outperform the champion model, ensuring that the best performing fit-for-purpose model is always considered for deployment. Challenger models also help avoid over-reliance on a single model, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challenger models with this notebook, you'll need to first have:

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. On the left sidebar that appears for your model, select Getting Started and select Validation from the DOCUMENT drop-down menu.
  2. Click Copy snippet to clipboard.
  3. Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:
# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    document="validation-report",
)
Note: you may need to restart the kernel to use updated packages.
2026-04-03 03:05:58,265 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test identifies pairs of features in the dataset that exhibit strong linear relationships, with the aim of detecting potential feature redundancy or multicollinearity. The results table lists the top ten feature pairs ranked by the absolute value of their Pearson correlation coefficients, along with a Pass or Fail status based on a threshold of 0.3. Only one feature pair exceeds the threshold, while the remaining pairs show lower correlation values and pass the test criteria.

Key insights:

  • One feature pair exceeds correlation threshold: The pair (Age, Exited) has a correlation coefficient of 0.3536, surpassing the 0.3 threshold and resulting in a Fail status.
  • All other feature pairs pass: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.2033 to 0.0355, all below the threshold and marked as Pass.
  • Predominantly weak linear relationships: Most feature pairs display weak linear associations, with coefficients clustered well below the threshold.

The test results indicate that the dataset contains minimal evidence of strong linear relationships among most feature pairs, with only the (Age, Exited) pair exhibiting moderate correlation above the defined threshold. The overall correlation structure suggests low risk of widespread multicollinearity or feature redundancy based on linear associations.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3536 Fail
(IsActiveMember, Exited) -0.2033 Pass
(Balance, NumOfProducts) -0.1554 Pass
(Balance, Exited) 0.1431 Pass
(NumOfProducts, Exited) -0.0607 Pass
(Tenure, IsActiveMember) -0.0598 Pass
(NumOfProducts, IsActiveMember) 0.0526 Pass
(Age, NumOfProducts) -0.0414 Pass
(Age, EstimatedSalary) -0.0402 Pass
(Balance, HasCrCard) -0.0355 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3536 Fail
1 (IsActiveMember, Exited) -0.2033 Pass
2 (Balance, NumOfProducts) -0.1554 Pass
3 (Balance, Exited) 0.1431 Pass
4 (NumOfProducts, Exited) -0.0607 Pass
5 (Tenure, IsActiveMember) -0.0598 Pass
6 (NumOfProducts, IsActiveMember) 0.0526 Pass
7 (Age, NumOfProducts) -0.0414 Pass
8 (Age, EstimatedSalary) -0.0402 Pass
9 (Balance, HasCrCard) -0.0355 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity within the dataset. The results table presents the top ten absolute Pearson correlation coefficients, along with the corresponding feature pairs and their Pass/Fail status based on a threshold of 0.3. All reported coefficients are below the threshold, and each feature pair is marked as Pass.

Key insights:

  • No feature pairs exceed correlation threshold: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest magnitude observed at 0.2033 between IsActiveMember and Exited.
  • Low to moderate linear relationships: The strongest correlations, both positive and negative, remain modest in magnitude, with coefficients ranging from -0.2033 to 0.0328 across the top ten feature pairs.
  • Consistent Pass status across all pairs: Every evaluated feature pair is marked as Pass, indicating no detected risk of linear redundancy or multicollinearity among the top correlated features.

The results indicate that the dataset does not exhibit high linear correlations among the evaluated feature pairs. All observed relationships fall well below the specified threshold, suggesting minimal risk of feature redundancy or multicollinearity based on linear association. The correlation structure supports the interpretability and stability of the model inputs.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.2033 Pass
(Balance, NumOfProducts) -0.1554 Pass
(Balance, Exited) 0.1431 Pass
(NumOfProducts, Exited) -0.0607 Pass
(Tenure, IsActiveMember) -0.0598 Pass
(NumOfProducts, IsActiveMember) 0.0526 Pass
(Balance, HasCrCard) -0.0355 Pass
(HasCrCard, IsActiveMember) -0.0345 Pass
(Tenure, EstimatedSalary) 0.0328 Pass
(CreditScore, EstimatedSalary) -0.0312 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
7908 680 10 0.00 2 1 0 187008.45 0 False False True
3389 452 7 153663.27 1 1 0 111868.23 0 True False True
2524 552 5 0.00 2 1 1 1351.41 0 False False False
3401 753 8 0.00 3 1 0 90747.94 1 False False True
3421 476 8 111905.43 1 0 1 197221.81 1 True False False
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Training a potential challenger model

We're curious how an alternate model compares to our champion model, so let's train a challenger model as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initializing the model objects

Initialize the model objects

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

You simply initialize this model object with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

  • The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
  • This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-04-03 03:06:10,565 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-03 03:06:10,566 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-03 03:06:10,567 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-03 03:06:10,570 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-04-03 03:06:10,572 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-03 03:06:10,574 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-03 03:06:10,575 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-03 03:06:10,576 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-04-03 03:06:10,579 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-03 03:06:10,601 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-03 03:06:10,602 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-03 03:06:10,624 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-04-03 03:06:10,626 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-03 03:06:10,639 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-03 03:06:10,640 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-03 03:06:10,652 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion model, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.CalibrationCurve Calibration Curve Evaluates the calibration of probability estimates by comparing predicted probabilities against observed... True False ['model', 'dataset'] {'n_bins': {'type': 'int', 'default': 10}} ['sklearn', 'model_performance', 'classification'] ['classification']
validmind.model_validation.sklearn.ClassifierPerformance Classifier Performance Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,... False True ['dataset', 'model'] {'average': {'type': 'str', 'default': 'macro'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix Confusion Matrix Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix... True False ['dataset', 'model'] {'threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning Hyper Parameters Tuning Performs exhaustive grid search over specified parameter ranges to find optimal model configurations... False True ['model', 'dataset'] {'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}} ['sklearn', 'model_performance'] ['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy Minimum Accuracy Checks if the model's prediction accuracy meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.7}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score Minimum F1 Score Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore Minimum ROCAUC Score Validates model by checking if the ROC AUC score meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison Models Performance Comparison Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,... False True ['dataset', 'models'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex Population Stability Index Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across... True True ['datasets', 'model'] {'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve Precision Recall Curve Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve.... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve ROC Curve Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors Regression Errors Assesses the performance and error distribution of a regression model using various error metrics.... False True ['model', 'dataset'] {} ['sklearn', 'model_performance'] ['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation Training Test Degradation Tests if model performance degradation between training and test datasets exceeds a predefined threshold.... False True ['datasets', 'model'] {'max_threshold': {'type': 'float', 'default': 0.1}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable GINI Table Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets.... False True ['dataset', 'model'] {} ['model_performance'] ['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift Calibration Curve Drift Evaluates changes in probability calibration between reference and monitoring datasets.... True True ['datasets', 'model'] {'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift Class Discrimination Drift Compares classification discrimination metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift Classification Accuracy Drift Compares classification accuracy metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift Confusion Matrix Drift Compares confusion matrix metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift ROC Curve Drift Compares ROC curves between reference and monitoring datasets.... True False ['datasets', 'model'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

  • The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
  • The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.
for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

The Classifier Performance test evaluates the predictive effectiveness of the classification model by reporting precision, recall, F1-score, accuracy, and ROC AUC metrics. The results are presented in two tables: one summarizing class-wise and aggregate precision, recall, and F1-scores, and another reporting overall accuracy and ROC AUC. These metrics provide a quantitative assessment of the model's ability to distinguish between classes and its general classification performance.

Key insights:

  • Balanced class-wise performance: Precision, recall, and F1-scores are similar across both classes, with precision ranging from 0.6284 to 0.6772 and recall from 0.635 to 0.671, indicating no substantial imbalance in predictive performance between classes.
  • Consistent aggregate metrics: Weighted and macro averages for precision, recall, and F1-score are closely aligned (all approximately 0.65), reflecting uniformity in model performance across classes.
  • Moderate overall accuracy: The model achieves an accuracy of 0.6522, indicating that approximately 65% of predictions match the true class labels.
  • ROC AUC indicates moderate separability: The ROC AUC score of 0.6903 suggests the model has moderate ability to distinguish between the two classes.

The results indicate that the model demonstrates consistent and balanced predictive performance across both classes, with moderate accuracy and ROC AUC values. The close alignment of class-wise and aggregate metrics suggests uniform model behavior without significant class bias. The ROC AUC and accuracy values reflect moderate discriminative capability, providing a clear quantitative profile of current model performance.

Tables

Precision, Recall, and F1

Class Precision Recall F1
0 0.6284 0.6710 0.6490
1 0.6772 0.6350 0.6554
Weighted Average 0.6538 0.6522 0.6523
Macro Average 0.6528 0.6530 0.6522

Accuracy and ROC AUC

Metric Value
Accuracy 0.6522
ROC AUC 0.6903
2026-04-03 03:06:18,087 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

The Confusion Matrix test evaluates the classification performance of the model by comparing predicted and actual class labels, providing a breakdown of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). The resulting matrix visually displays the distribution of correct and incorrect predictions, enabling assessment of the model's ability to distinguish between classes. The matrix presents the following counts: 214 True Positives, 208 True Negatives, 102 False Positives, and 123 False Negatives.

Key insights:

  • Higher True Positive and True Negative counts: The model correctly classified 214 positive cases and 208 negative cases, indicating a substantial proportion of accurate predictions for both classes.
  • Notable False Negative and False Positive rates: There are 123 False Negatives and 102 False Positives, reflecting a non-trivial rate of misclassification for both classes.
  • Balanced error distribution: The counts of False Positives and False Negatives are of similar magnitude, suggesting that misclassification is not heavily skewed toward one class.

The confusion matrix reveals that the model demonstrates a balanced ability to correctly identify both positive and negative cases, with True Positives and True Negatives outnumbering their respective error counterparts. However, the presence of over 100 instances each of False Positives and False Negatives indicates that misclassification remains a material consideration for both classes. The error rates are relatively balanced, with no pronounced bias toward either type of misclassification.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:f3c7
2026-04-03 03:06:23,722 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

The Minimum Accuracy test evaluates whether the model's prediction accuracy meets or exceeds a specified threshold, providing a direct measure of overall predictive correctness. The results table presents the model's observed accuracy score, the minimum threshold set for the test, and the corresponding pass/fail outcome. The model's accuracy score is compared against the threshold to determine if the model satisfies the minimum performance criterion.

Key insights:

  • Accuracy score below threshold: The model achieved an accuracy score of 0.6522, which is below the specified threshold of 0.7.
  • Test outcome is Fail: The test result is marked as "Fail," indicating the model did not meet the minimum accuracy requirement established for this evaluation.

The results indicate that the model's predictive accuracy did not reach the predefined minimum threshold, as evidenced by the observed score of 0.6522 against the 0.7 benchmark. This outcome reflects a shortfall in overall prediction correctness relative to the test criterion.

Tables

Score Threshold Pass/Fail
0.6522 0.7 Fail
2026-04-03 03:06:26,755 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

The MinimumF1Score:logreg_champion test evaluates whether the model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents the observed F1 score, the minimum threshold for passing, and the pass/fail outcome. The model's F1 score is reported as 0.6554, with a threshold set at 0.5, and the test outcome is marked as "Pass".

Key insights:

  • F1 score exceeds minimum threshold: The model achieved an F1 score of 0.6554, which is above the required threshold of 0.5.
  • Test outcome is positive: The model passed the test, indicating that its balance between precision and recall meets the established performance criterion.

The results indicate that the model demonstrates balanced classification performance on the validation set, with the F1 score surpassing the minimum required level. This outcome reflects effective identification of positive cases while controlling for false positives and negatives within the defined acceptance criteria.

Tables

Score Threshold Pass/Fail
0.6554 0.5 Pass
2026-04-03 03:06:30,091 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

The ROC Curve test evaluates the binary classification performance of the log_model_champion by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) on the test_dataset_final. The resulting plot displays the trade-off between the true positive rate and false positive rate across all classification thresholds, with the model's ROC curve compared against a baseline representing random classification. The AUC value is provided as a summary metric of the model's discriminative ability.

Key insights:

  • AUC indicates moderate discriminative power: The model achieves an AUC of 0.69, reflecting moderate ability to distinguish between the positive and negative classes.
  • ROC curve consistently above random baseline: The ROC curve remains above the diagonal line representing random performance (AUC = 0.5) across all thresholds, indicating the model provides meaningful separation between classes.
  • No evidence of near-random classification: The curve does not approach the random baseline at any threshold, suggesting the model avoids high-risk performance zones.

The ROC analysis demonstrates that the log_model_champion exhibits moderate classification performance on the test dataset, with an AUC of 0.69 indicating reliable but not strong separation between classes. The model consistently outperforms random guessing, and no thresholds display performance degradation toward randomness.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:462c
2026-04-03 03:06:34,462 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion model doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Need more help?):

  1. From the Inventory in the ValidMind Platform, go to the model you connected to earlier.

  2. In the left sidebar that appears for your model, click Validation under Documents.

  3. Locate the Data Preparation section and click on 2.2.2. Model Performance to expand that section.

  4. Under the Model Performance Metrics section, locate Artifacts then click Link Artifact to Report:

    Screenshot showing the validation report with the link artifact option highlighted

  5. Select Validation Issue as the type of artifact.

  6. Click + Add Validation Issue to add a validation issue type artifact.

  7. Enter in the details for your validation issue, for example:

    • TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
    • RISK AREA — Model Performance
    • DOCUMENTATION SECTION — 3.2. Model Evaluation
    • DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
  8. Click Save.

  9. Select the validation issue you just added to link to your validation report and click Update Linked Artifacts to insert your validation issue.

  10. Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger model:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

The Classifier Performance: champion_vs_challenger test evaluates the predictive performance of classification models using precision, recall, F1-score, accuracy, and ROC AUC metrics. The results table presents these metrics for two models—log_model_champion and rf_model—across both classes, as well as macro and weighted averages. Additional summary metrics for accuracy and ROC AUC are provided for each model, enabling direct comparison of their classification effectiveness.

Key insights:

  • rf_model outperforms log_model_champion across all metrics: rf_model achieves higher precision, recall, and F1-scores for both classes, as well as higher macro and weighted averages.
  • Higher overall accuracy and ROC AUC for rf_model: rf_model records an accuracy of 0.7202 and ROC AUC of 0.79, compared to log_model_champion's accuracy of 0.6522 and ROC AUC of 0.6903.
  • Consistent class-wise performance improvement: For class 0, rf_model shows a precision of 0.6937 and recall of 0.7452, while for class 1, precision is 0.7484 and recall is 0.6973, both exceeding the corresponding values for log_model_champion.

The results indicate that rf_model demonstrates superior classification performance relative to log_model_champion, as evidenced by higher precision, recall, F1-scores, accuracy, and ROC AUC values. The performance improvement is consistent across both classes and aggregate metrics, highlighting rf_model's stronger predictive capability in this evaluation.

Tables

model Class Precision Recall F1
log_model_champion 0 0.6284 0.6710 0.6490
log_model_champion 1 0.6772 0.6350 0.6554
log_model_champion Weighted Average 0.6538 0.6522 0.6523
log_model_champion Macro Average 0.6528 0.6530 0.6522
rf_model 0 0.6937 0.7452 0.7185
rf_model 1 0.7484 0.6973 0.7220
rf_model Weighted Average 0.7222 0.7202 0.7203
rf_model Macro Average 0.7211 0.7212 0.7202
model Metric Value
log_model_champion Accuracy 0.6522
log_model_champion ROC AUC 0.6903
rf_model Accuracy 0.7202
rf_model ROC AUC 0.7900
2026-04-03 03:06:38,494 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

The Confusion Matrix: champion_vs_challenger test evaluates the predictive performance of classification models by comparing predicted and actual class labels, quantifying the counts of true positives, true negatives, false positives, and false negatives. The results are presented as annotated heatmaps for both the champion (log_model_champion) and challenger (rf_model) models, enabling direct comparison of classification outcomes. Each matrix cell displays the count of predictions for each outcome type, providing a detailed breakdown of model performance across both positive and negative classes.

Key insights:

  • Challenger model reduces classification errors: The challenger model (rf_model) records fewer false positives (79 vs. 102) and false negatives (102 vs. 123) compared to the champion model (log_model_champion).
  • Higher correct classification rates in challenger: The challenger model achieves higher true positives (235 vs. 214) and true negatives (231 vs. 208) than the champion model.
  • Improved balance between sensitivity and specificity: The challenger model demonstrates a more favorable distribution between correctly and incorrectly classified instances across both classes.

The confusion matrix results indicate that the challenger model outperforms the champion model in both correctly identifying positive and negative cases, as evidenced by higher true positive and true negative counts and lower false positive and false negative counts. This reflects an overall improvement in classification accuracy and error reduction for the challenger model relative to the champion.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:3f0b
ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:37e1
2026-04-03 03:06:42,769 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

The Minimum Accuracy test evaluates whether each model's prediction accuracy meets or exceeds a specified threshold, providing a direct measure of overall classification correctness. The results table presents accuracy scores for two models—log_model_champion and rf_model—compared against a threshold of 0.7, with pass/fail outcomes indicated for each. Accuracy is calculated as the proportion of correct predictions to total predictions, offering a holistic view of model performance on the tested dataset.

Key insights:

  • rf_model surpasses accuracy threshold: rf_model achieved an accuracy score of 0.7202, exceeding the 0.7 threshold and receiving a "Pass" outcome.
  • log_model_champion falls below threshold: log_model_champion recorded an accuracy score of 0.6522, which is below the 0.7 threshold, resulting in a "Fail" outcome.
  • Clear differentiation in model performance: The two models display a notable difference in accuracy, with rf_model outperforming log_model_champion by approximately 6.8 percentage points.

The results indicate that rf_model meets the minimum accuracy requirement, while log_model_champion does not reach the specified threshold. This differentiation highlights a material performance gap between the models, with rf_model demonstrating higher predictive accuracy on the evaluated dataset.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6522 0.7 Fail
rf_model 0.7202 0.7 Pass
2026-04-03 03:06:46,616 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

The MinimumF1Score:champion_vs_challenger test evaluates whether each model's F1 score on the validation set meets or exceeds a predefined minimum threshold, ensuring balanced performance between precision and recall. The results table presents F1 scores for both the champion and challenger models, alongside the minimum threshold and pass/fail status. Both models are assessed independently, with their respective F1 scores compared directly to the threshold value.

Key insights:

  • Both models exceed the minimum F1 threshold: The log_model_champion achieved an F1 score of 0.6554 and the rf_model achieved 0.722, both surpassing the minimum threshold of 0.5.
  • Consistent pass status across models: Both models are marked as "Pass," indicating that neither model exhibited performance below the required F1 score standard.
  • rf_model demonstrates higher F1 performance: The rf_model outperforms the log_model_champion by a margin of 0.0666 in F1 score, reflecting stronger balance between precision and recall on the validation set.

Both the champion and challenger models satisfy the minimum F1 score requirement, indicating balanced classification performance on the validation dataset. The rf_model demonstrates a higher F1 score relative to the log_model_champion, suggesting improved effectiveness in managing the trade-off between precision and recall. No model exhibited F1 performance below the established threshold.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6554 0.5 Pass
rf_model 0.7220 0.5 Pass
2026-04-03 03:06:50,393 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

The ROC Curve test evaluates the discrimination ability of binary classification models by plotting the True Positive Rate against the False Positive Rate at various thresholds and calculating the Area Under the Curve (AUC) score. The results display ROC curves and corresponding AUC values for two models—log_model_champion and rf_model—on the test_dataset_final, with each curve compared against a baseline representing random classification (AUC = 0.5). The visualizations and AUC metrics provide a direct comparison of each model’s ability to distinguish between the positive and negative classes.

Key insights:

  • rf_model demonstrates higher discriminative power: The rf_model achieves an AUC of 0.79, indicating stronger separation between classes compared to the log_model_champion.
  • log_model_champion shows moderate performance: The log_model_champion records an AUC of 0.69, reflecting moderate discriminative ability above random chance but below that of the rf_model.
  • Both models outperform random classification: Both ROC curves are consistently above the diagonal line representing random performance (AUC = 0.5), confirming that each model provides meaningful predictive value on the test dataset.

The comparative ROC analysis reveals that both models provide measurable discrimination between classes, with the rf_model exhibiting notably higher classification performance than the log_model_champion. The AUC values substantiate that both models surpass random prediction, with the rf_model offering a more robust solution for binary classification tasks in this context.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:3a1d
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:b3f8
2026-04-03 03:06:55,231 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document
Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger model by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.OverfitDiagnosis Overfit Diagnosis Assesses potential overfitting in a model's predictions, identifying regions where performance between training and... True True ['model', 'datasets'] {'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}} ['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis'] ['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis Robustness Diagnosis Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions.... True True ['datasets', 'model'] {'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}} ['sklearn', 'model_diagnosis', 'visualization'] ['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis Weakspots Diagnosis Identifies and visualizes weak spots in a machine learning model's performance across various sections of the... True True ['datasets', 'model'] {'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization'] ['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

  • Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
  • The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.
vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

The Overfit Diagnosis: champion_vs_challenger test evaluates the extent to which model performance differs between training and test sets across feature segments, using AUC as the performance metric for classification. The results are presented as AUC gaps for each feature bin, with a threshold of 0.04 used to flag regions of potential overfitting. Both tabular and visual outputs are provided for the champion (logistic regression) and challenger (random forest) models, highlighting feature segments where the performance gap exceeds the threshold.

Key insights:

  • Widespread overfitting in challenger (random forest) model: The random forest model exhibits substantial AUC gaps across nearly all feature segments, with gaps frequently exceeding 0.2 and reaching as high as 0.8788 for NumOfProducts and 0.4018 for Balance. All monitored features show AUC gaps well above the 0.04 threshold.
  • Localized overfitting in champion (logistic regression) model: The logistic regression model shows moderate AUC gaps in specific feature bins, with the largest gaps observed for CreditScore (0.1593), Tenure (0.1444), and EstimatedSalary (0.1068). Most other feature segments remain below or near the threshold.
  • Feature-specific risk patterns: For the random forest model, overfitting is present across all major features, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Geography, and Gender. In contrast, the logistic regression model's overfitting is more isolated, with only a few bins per feature exceeding the threshold.
  • Magnitude and direction of AUC gaps: The random forest model consistently achieves perfect AUC (1.0) on training data, while test AUCs are substantially lower, indicating classic overfitting. The logistic regression model shows smaller, but still notable, discrepancies between training and test AUCs in certain bins.

The results indicate that the random forest challenger model demonstrates pervasive overfitting across all evaluated feature segments, with AUC gaps far exceeding the diagnostic threshold. The logistic regression champion model displays more controlled generalization, with overfitting limited to specific bins within a subset of features. These findings highlight a clear distinction in generalization performance between the two models, with the random forest model exhibiting high risk of overfitting and the logistic regression model maintaining more stable out-of-sample behavior except in a few localized regions.

Tables

model Feature Slice Number of Training Records Number of Test Records Training AUC Test AUC Gap
log_model_champion CreditScore (750.0, 800.0] 235 57 0.7399 0.5806 0.1593
log_model_champion Tenure (2.0, 3.0] 297 56 0.6941 0.5497 0.1444
log_model_champion Tenure (7.0, 8.0] 255 72 0.7472 0.6688 0.0785
log_model_champion Balance (100359.236, 125449.045] 608 162 0.7162 0.6677 0.0485
log_model_champion Balance (150538.854, 175628.663] 192 47 0.6082 0.5527 0.0555
log_model_champion Balance (200718.472, 225808.281] 18 2 0.0444 0.0000 0.0444
log_model_champion NumOfProducts (2.8, 3.1] 151 34 0.9456 0.8485 0.0971
log_model_champion EstimatedSalary (19996.697, 39981.814] 237 66 0.6579 0.6144 0.0434
log_model_champion EstimatedSalary (59966.931, 79952.048] 287 74 0.6585 0.5975 0.0610
log_model_champion EstimatedSalary (139907.399, 159892.516] 250 67 0.7182 0.6114 0.1068
log_model_champion EstimatedSalary (179877.633, 199862.75] 269 62 0.7164 0.6618 0.0546
rf_model CreditScore (450.0, 500.0] 110 33 1.0000 0.7575 0.2425
rf_model CreditScore (500.0, 550.0] 267 62 1.0000 0.7741 0.2259
rf_model CreditScore (550.0, 600.0] 386 95 1.0000 0.7708 0.2292
rf_model CreditScore (600.0, 650.0] 466 123 1.0000 0.8031 0.1969
rf_model CreditScore (650.0, 700.0] 498 130 1.0000 0.8076 0.1924
rf_model CreditScore (700.0, 750.0] 404 98 1.0000 0.7874 0.2126
rf_model CreditScore (750.0, 800.0] 235 57 1.0000 0.7590 0.2410
rf_model CreditScore (800.0, 850.0] 153 38 1.0000 0.7986 0.2014
rf_model Tenure (-0.01, 1.0] 354 104 1.0000 0.7512 0.2488
rf_model Tenure (1.0, 2.0] 254 77 1.0000 0.7120 0.2880
rf_model Tenure (2.0, 3.0] 297 56 1.0000 0.6574 0.3426
rf_model Tenure (3.0, 4.0] 264 67 1.0000 0.8089 0.1911
rf_model Tenure (4.0, 5.0] 259 73 1.0000 0.8333 0.1667
rf_model Tenure (5.0, 6.0] 249 56 1.0000 0.8450 0.1550
rf_model Tenure (6.0, 7.0] 265 58 1.0000 0.8768 0.1232
rf_model Tenure (7.0, 8.0] 255 72 1.0000 0.7227 0.2773
rf_model Tenure (8.0, 9.0] 257 59 1.0000 0.9115 0.0885
rf_model Tenure (9.0, 10.0] 131 25 1.0000 0.8160 0.1840
rf_model Balance (-250.898, 25089.809] 808 210 1.0000 0.8622 0.1378
rf_model Balance (50179.618, 75269.427] 101 18 1.0000 0.6944 0.3056
rf_model Balance (75269.427, 100359.236] 310 66 1.0000 0.7045 0.2955
rf_model Balance (100359.236, 125449.045] 608 162 1.0000 0.7097 0.2903
rf_model Balance (125449.045, 150538.854] 477 121 1.0000 0.8140 0.1860
rf_model Balance (150538.854, 175628.663] 192 47 1.0000 0.5982 0.4018
rf_model Balance (175628.663, 200718.472] 45 16 1.0000 0.7540 0.2460
rf_model NumOfProducts (0.997, 1.3] 1478 370 1.0000 0.6983 0.3017
rf_model NumOfProducts (1.9, 2.2] 918 237 1.0000 0.6747 0.3253
rf_model NumOfProducts (2.8, 3.1] 151 34 1.0000 0.1212 0.8788
rf_model HasCrCard (-0.001, 0.1] 808 174 1.0000 0.7847 0.2153
rf_model HasCrCard (0.9, 1.0] 1777 473 1.0000 0.7920 0.2080
rf_model IsActiveMember (-0.001, 0.1] 1363 347 1.0000 0.7724 0.2276
rf_model IsActiveMember (0.9, 1.0] 1222 300 1.0000 0.7716 0.2284
rf_model EstimatedSalary (-188.271, 19996.697] 228 75 1.0000 0.8448 0.1552
rf_model EstimatedSalary (19996.697, 39981.814] 237 66 1.0000 0.7156 0.2844
rf_model EstimatedSalary (39981.814, 59966.931] 253 59 1.0000 0.7733 0.2267
rf_model EstimatedSalary (59966.931, 79952.048] 287 74 1.0000 0.8156 0.1844
rf_model EstimatedSalary (79952.048, 99937.165] 256 63 1.0000 0.7637 0.2363
rf_model EstimatedSalary (99937.165, 119922.282] 264 58 1.0000 0.8133 0.1867
rf_model EstimatedSalary (119922.282, 139907.399] 274 56 1.0000 0.8545 0.1455
rf_model EstimatedSalary (139907.399, 159892.516] 250 67 1.0000 0.7233 0.2767
rf_model EstimatedSalary (159892.516, 179877.633] 267 67 1.0000 0.7845 0.2155
rf_model EstimatedSalary (179877.633, 199862.75] 269 62 1.0000 0.8755 0.1245
rf_model Geography_Germany (-0.001, 0.1] 1783 434 1.0000 0.7711 0.2289
rf_model Geography_Germany (0.9, 1.0] 802 213 1.0000 0.7842 0.2158
rf_model Geography_Spain (-0.001, 0.1] 1988 500 1.0000 0.7895 0.2105
rf_model Geography_Spain (0.9, 1.0] 597 147 1.0000 0.7858 0.2142
rf_model Gender_Male (-0.001, 0.1] 1278 299 1.0000 0.7768 0.2232
rf_model Gender_Male (0.9, 1.0] 1307 348 1.0000 0.7902 0.2098

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9786
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:e463
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:edf0
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:91d6
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6441
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9fc9
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c3ca
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:8758
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:16b5
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:3c01
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c0ef
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:f22b
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:bbc5
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:93dd
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:b095
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2f84
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9125
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:e49c
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1196
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:fe4d
2026-04-03 03:07:13,396 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test. Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

The Robustness Diagnosis: Champion vs. LogRegression test evaluates the resilience of the log_model_champion and rf_model to input perturbations by introducing Gaussian noise to numeric features and measuring AUC decay across increasing noise levels. Results are presented for both train and test datasets, with AUC and performance decay tracked at each perturbation scale. The visualizations and tabular data provide a comparative view of how each model's predictive performance responds to incremental noise, highlighting differences in robustness and sensitivity.

Key insights:

  • Logistic regression model demonstrates stable robustness: The log_model_champion maintains consistent AUC values on both train (0.6801 to 0.6663) and test (0.6903 to 0.6755) datasets as perturbation size increases from 0.0 to 0.5, with performance decay remaining below 0.015 on test data and all test points passing the threshold.
  • Random forest model exhibits pronounced performance decay: The rf_model shows substantial AUC reduction on the train dataset (from 1.0 to 0.8051) and moderate decay on the test dataset (from 0.79 to 0.7033) as noise increases, with performance decay reaching 0.1949 on train and 0.0867 on test at the highest perturbation.
  • Threshold failures concentrated in random forest model: The rf_model fails the robustness threshold on the train dataset at perturbation sizes 0.2 and above, and on the test dataset at the highest perturbation (0.5), while the log_model_champion passes all thresholds across both datasets.
  • Performance decay is more gradual in logistic regression: The log_model_champion displays minimal and gradual AUC decline, with no abrupt drops or negative performance decay values, indicating consistent behavior under noise.

The results indicate that the log_model_champion exhibits strong robustness to Gaussian noise, with minimal AUC degradation and consistent threshold passing across all tested perturbation levels. In contrast, the rf_model is more sensitive to input noise, particularly on the train dataset, where performance decay is substantial and threshold failures occur at moderate to high perturbation sizes. The comparative analysis highlights the greater resilience of the logistic regression model to noisy input conditions relative to the random forest model in this context.

Tables

model Perturbation Size Dataset Row Count AUC Performance Decay Passed
log_model_champion Baseline (0.0) train_dataset_final 2585 0.6801 0.0000 True
log_model_champion Baseline (0.0) test_dataset_final 647 0.6903 0.0000 True
log_model_champion 0.1 train_dataset_final 2585 0.6792 0.0009 True
log_model_champion 0.1 test_dataset_final 647 0.6878 0.0025 True
log_model_champion 0.2 train_dataset_final 2585 0.6802 -0.0001 True
log_model_champion 0.2 test_dataset_final 647 0.6929 -0.0026 True
log_model_champion 0.3 train_dataset_final 2585 0.6736 0.0065 True
log_model_champion 0.3 test_dataset_final 647 0.6828 0.0075 True
log_model_champion 0.4 train_dataset_final 2585 0.6710 0.0091 True
log_model_champion 0.4 test_dataset_final 647 0.6782 0.0122 True
log_model_champion 0.5 train_dataset_final 2585 0.6663 0.0138 True
log_model_champion 0.5 test_dataset_final 647 0.6755 0.0148 True
rf_model Baseline (0.0) train_dataset_final 2585 1.0000 0.0000 True
rf_model Baseline (0.0) test_dataset_final 647 0.7900 0.0000 True
rf_model 0.1 train_dataset_final 2585 0.9843 0.0157 True
rf_model 0.1 test_dataset_final 647 0.7844 0.0056 True
rf_model 0.2 train_dataset_final 2585 0.9434 0.0566 False
rf_model 0.2 test_dataset_final 647 0.7930 -0.0029 True
rf_model 0.3 train_dataset_final 2585 0.8980 0.1020 False
rf_model 0.3 test_dataset_final 647 0.7676 0.0224 True
rf_model 0.4 train_dataset_final 2585 0.8573 0.1427 False
rf_model 0.4 test_dataset_final 647 0.7687 0.0214 True
rf_model 0.5 train_dataset_final 2585 0.8051 0.1949 False
rf_model 0.5 test_dataset_final 647 0.7033 0.0867 False

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:77bc
ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:b848
2026-04-03 03:07:24,192 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI
['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

The FeaturesAUC:champion_vs_challenger test evaluates the univariate discriminatory power of each feature by calculating the Area Under the Curve (AUC) for each feature against the binary target. The resulting bar chart displays the AUC scores for individual features, providing a direct measure of each feature's ability to distinguish between classes when considered in isolation. Higher AUC values indicate stronger univariate predictive power, while lower values suggest limited individual contribution to class separation.

Key insights:

  • Geography_Germany and Balance show highest univariate AUC: Geography_Germany and Balance achieve the highest AUC scores, both approaching 0.58, indicating these features have the strongest individual discriminatory power among those evaluated.
  • HasCrCard, Tenure, and EstimatedSalary display moderate AUC: These features exhibit AUC values around 0.50, reflecting moderate ability to differentiate between classes on their own.
  • IsActiveMember and NumOfProducts have lowest AUC: These features register the lowest AUC scores, both below 0.40, indicating limited univariate predictive strength in the current dataset.

The results indicate that Geography_Germany and Balance are the most individually informative features for binary class separation in this dataset, while several other features contribute moderate to low univariate discriminatory power. The observed spread in AUC values highlights varying levels of individual feature relevance, with some features providing minimal standalone predictive value.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:d850
ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:06e2
2026-04-03 03:07:29,454 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

The Permutation Feature Importance (PFI) test evaluates the relative importance of each input feature by measuring the decrease in model performance when feature values are randomly permuted. The results are presented as bar plots for both the champion (logistic regression) and challenger (random forest) models, with each bar representing the importance score for a given feature. The plots allow for direct comparison of feature reliance between the two model types.

Key insights:

  • Distinct feature reliance between models: The champion model (logistic regression) assigns highest importance to IsActiveMember, Geography_Germany, and Gender_Male, while the challenger model (random forest) ranks NumOfProducts, Balance, and Geography_Germany as most important.
  • NumOfProducts critical for challenger, minimal for champion: NumOfProducts is the most influential feature in the challenger model but has negligible importance in the champion model.
  • IsActiveMember and Geography_Germany consistently important: Both models assign substantial importance to IsActiveMember and Geography_Germany, though the magnitude and ranking differ.
  • Broader feature spread in challenger model: The challenger model distributes importance across a wider set of features, with several variables (Balance, Tenure, EstimatedSalary) showing moderate importance, whereas the champion model's importance is concentrated in fewer features.

The PFI results indicate that the champion and challenger models rely on different subsets of features to drive predictions, with the challenger model exhibiting a more distributed feature importance profile. Certain features, such as IsActiveMember and Geography_Germany, are influential in both models, while others, notably NumOfProducts, show model-specific significance. This divergence in feature reliance highlights differences in model structure and potential sensitivity to input variables.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:7849
ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:738d
2026-04-03 03:07:38,977 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

The SHAPGlobalImportance:champion_vs_challenger test evaluates and visualizes the global feature importance for both the champion (log_model_champion) and challenger (rf_model) models using SHAP values. The results include normalized mean importance plots and SHAP summary plots, which display the relative contribution of each feature to model predictions and the distribution of SHAP values across instances. These visualizations facilitate comparison of feature influence and model reasoning between the two models.

Key insights:

  • Distinct feature dominance in champion model: The log_model_champion model assigns the highest normalized SHAP importance to IsActiveMember, Geography_Germany, and Gender_Male, with IsActiveMember reaching the maximum normalized value.
  • Broader feature utilization in champion model: The champion model distributes importance across a wider set of features, including Balance, CreditScore, Tenure, NumOfProducts, EstimatedSalary, HasCrCard, and Geography_Spain, though with lower relative importance.
  • Challenger model focuses on fewer features: The rf_model challenger model shows SHAP importance concentrated almost exclusively on Tenure and CreditScore, with minimal or no contribution from other features.
  • SHAP value distributions indicate model reasoning: The summary plots for the champion model display a range of SHAP value impacts for top features, with visible variation and both positive and negative contributions, while the challenger model’s SHAP distributions are tightly clustered around Tenure and CreditScore.
  • No evidence of anomalous or illogical feature importance: All features with high SHAP values in both models are consistent with typical input variables for attrition/churn management.

The SHAP global importance analysis reveals that the champion model leverages a broader set of features with a clear dominance by IsActiveMember, Geography_Germany, and Gender_Male, while the challenger model relies almost exclusively on Tenure and CreditScore. The distribution of SHAP values in the summary plots indicates that the champion model incorporates more nuanced feature interactions, whereas the challenger model’s predictions are driven by a narrower feature set. No irregularities or unexpected feature importances are observed in either model.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:2a7d
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:3128
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:a088
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:2e1b
2026-04-03 03:07:48,935 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial