ValidMind for validation 3 — Developing a potential challenger

Learn how to use ValidMind for your end-to-end validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger and then pass your challenger and its predictions to ValidMind.

A challenger is an alternate record (model) that attempts to outperform the champion, ensuring that the best performing fit-for-purpose record is always considered for deployment. Challengers also help avoid over-reliance on a single record, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challengers with this notebook, you'll need to first have:

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. On the left sidebar that appears for your model, select Getting Started and select Validation from the DOCUMENT drop-down menu.

  2. Click Copy snippet to clipboard.

  3. Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    document="validation-report",
)
Note: you may need to restart the kernel to use updated packages.
2026-05-26 22:11:46,905 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among dataset features to identify potentially redundant or strongly associated variables using a correlation threshold of 0.3. The results table lists the top 10 feature pairs by Pearson correlation coefficient, showing each pair’s coefficient and corresponding Pass/Fail outcome. Observed coefficients range from -0.2113 to 0.364, with one pair exceeding the configured threshold and the remaining reported pairs falling below it.

Key insights:

  • One pair exceeds threshold: The pair (Age, Exited) has the highest reported correlation at 0.364 and is the only entry marked Fail, indicating it exceeds the configured threshold of 0.3.
  • Remaining reported correlations are modest: All other listed feature pairs are marked Pass, with absolute correlation values no greater than 0.2113. This includes (IsActiveMember, Exited) at -0.2113 and (Balance, NumOfProducts) at -0.1773.
  • Both positive and negative relationships appear: The reported coefficients include positive values such as (Balance, Exited) = 0.1444 and negative values such as (NumOfProducts, Exited) = -0.0505, indicating mixed linear association directions across the listed pairs.
  • Most listed associations are weak in magnitude: Aside from (Age, Exited), the remaining reported coefficients cluster close to zero, with several values between approximately -0.05 and 0.05, including (Age, Balance) = 0.0539, (NumOfProducts, IsActiveMember) = 0.0524, and (Tenure, IsActiveMember) = -0.0411.

The reported correlation structure is limited, with only one feature pair, (Age, Exited), crossing the test threshold and all other listed pairs remaining below it. Among the top 10 reported relationships, the largest non-failing absolute correlation is 0.2113, and several coefficients are near zero, indicating limited linear association across most listed pairs. Overall, the result identifies a single above-threshold relationship within the reported set while the remaining observed pairwise correlations are comparatively small.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3640 Fail
(IsActiveMember, Exited) -0.2113 Pass
(Balance, NumOfProducts) -0.1773 Pass
(Balance, Exited) 0.1444 Pass
(Age, Balance) 0.0539 Pass
(NumOfProducts, IsActiveMember) 0.0524 Pass
(Age, NumOfProducts) -0.0518 Pass
(NumOfProducts, Exited) -0.0505 Pass
(CreditScore, Exited) -0.0497 Pass
(Tenure, IsActiveMember) -0.0411 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3640 Fail
1 (IsActiveMember, Exited) -0.2113 Pass
2 (Balance, NumOfProducts) -0.1773 Pass
3 (Balance, Exited) 0.1444 Pass
4 (Age, Balance) 0.0539 Pass
5 (NumOfProducts, IsActiveMember) 0.0524 Pass
6 (Age, NumOfProducts) -0.0518 Pass
7 (NumOfProducts, Exited) -0.0505 Pass
8 (CreditScore, Exited) -0.0497 Pass
9 (Tenure, IsActiveMember) -0.0411 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among dataset features to identify potentially redundant variables or multicollinearity. The result table reports the top feature pairs ranked by Pearson correlation coefficient, along with pass/fail status based on the configured absolute correlation threshold of 0.3. In this run, the reported coefficients range from -0.2113 to 0.1444, and all listed feature pairs are marked as Pass. The table includes both positive and negative relationships among predictors and the target variable Exited.

Key insights:

  • No correlations exceed threshold: All reported absolute correlation coefficients are below the 0.3 threshold. Every listed feature pair receives a Pass result under the test configuration.
  • Strongest observed relationship is modest: The largest absolute correlation in the reported output is between IsActiveMember and Exited at -0.2113. This is the closest relationship to the threshold, but it remains below the fail criterion.
  • Top feature relationships are weak: The next largest reported correlations are Balance with NumOfProducts at -0.1773 and Balance with Exited at 0.1444. Remaining listed coefficients are all close to zero, with absolute values at or below 0.0524.
  • Both positive and negative associations appear: The reported correlations include negative values such as CreditScore with Exited (-0.0497) and positive values such as Balance with EstimatedSalary (0.0352). The direction of relationships varies, but magnitudes remain small throughout the reported top pairs.

The reported correlation structure shows no feature pairs breaching the configured high-correlation threshold. The strongest observed relationship is modest in magnitude, and the remaining listed coefficients are weak to near-zero. Based on the reported top correlations, the test output does not indicate pronounced linear redundancy among the displayed feature pairs.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(IsActiveMember, Exited) -0.2113 Pass
(Balance, NumOfProducts) -0.1773 Pass
(Balance, Exited) 0.1444 Pass
(NumOfProducts, IsActiveMember) 0.0524 Pass
(NumOfProducts, Exited) -0.0505 Pass
(CreditScore, Exited) -0.0497 Pass
(Tenure, IsActiveMember) -0.0411 Pass
(Balance, EstimatedSalary) 0.0352 Pass
(Balance, HasCrCard) -0.0350 Pass
(CreditScore, EstimatedSalary) -0.0286 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
2257 801 5 0.00 2 1 0 66256.27 0 False False True
888 613 3 0.00 1 1 1 41724.72 0 False True True
1249 632 5 97854.37 2 1 0 93536.38 0 False True False
197 659 6 117411.60 1 1 1 45071.09 1 True False True
2591 447 2 0.00 2 1 0 33879.26 1 False False True
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion submitted by the development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Training a potential challenger model

We're curious how an alternate model compares to our champion, so let's train a challenger as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initialize the ValidMind models

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

  • Despite the naming convention, ValidMind model objects can be any type of record you want to test, document, validate, or monitor with the ValidMind Library.
  • From classical statistical and machine learning models, to generative and agentic AI systems and more, the ValidMind model object provides a consistent wrapper around your record so it can be passed as a unified input to any ValidMind test or test suite, with results sent directly to the ValidMind Platform.

Initialize your model objects with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

  • The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
  • This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-05-26 22:12:03,716 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-05-26 22:12:03,718 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-05-26 22:12:03,718 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-05-26 22:12:03,720 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-05-26 22:12:03,722 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-05-26 22:12:03,723 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-05-26 22:12:03,724 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-05-26 22:12:03,725 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-05-26 22:12:03,727 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-05-26 22:12:03,749 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-05-26 22:12:03,750 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-05-26 22:12:03,772 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-05-26 22:12:03,774 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-05-26 22:12:03,785 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-05-26 22:12:03,786 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-05-26 22:12:03,797 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.CalibrationCurve Calibration Curve Evaluates the calibration of probability estimates by comparing predicted probabilities against observed... True False ['model', 'dataset'] {'n_bins': {'type': 'int', 'default': 10}} ['sklearn', 'model_performance', 'classification'] ['classification']
validmind.model_validation.sklearn.ClassifierPerformance Classifier Performance Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,... False True ['dataset', 'model'] {'average': {'type': 'str', 'default': 'macro'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix Confusion Matrix Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix... True False ['dataset', 'model'] {'threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning Hyper Parameters Tuning Performs exhaustive grid search over specified parameter ranges to find optimal model configurations... False True ['model', 'dataset'] {'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}} ['sklearn', 'model_performance'] ['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy Minimum Accuracy Checks if the model's prediction accuracy meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.7}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score Minimum F1 Score Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore Minimum ROCAUC Score Validates model by checking if the ROC AUC score meets or surpasses a specified threshold.... False True ['dataset', 'model'] {'min_threshold': {'type': 'float', 'default': 0.5}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison Models Performance Comparison Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,... False True ['dataset', 'models'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex Population Stability Index Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across... True True ['datasets', 'model'] {'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve Precision Recall Curve Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve.... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve ROC Curve Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic... True False ['model', 'dataset'] {} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors Regression Errors Assesses the performance and error distribution of a regression model using various error metrics.... False True ['model', 'dataset'] {} ['sklearn', 'model_performance'] ['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation Training Test Degradation Tests if model performance degradation between training and test datasets exceeds a predefined threshold.... False True ['datasets', 'model'] {'max_threshold': {'type': 'float', 'default': 0.1}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable GINI Table Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets.... False True ['dataset', 'model'] {} ['model_performance'] ['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift Calibration Curve Drift Evaluates changes in probability calibration between reference and monitoring datasets.... True True ['datasets', 'model'] {'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift Class Discrimination Drift Compares classification discrimination metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift Classification Accuracy Drift Compares classification accuracy metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift Confusion Matrix Drift Compares confusion matrix metrics between reference and monitoring datasets.... False True ['datasets', 'model'] {'drift_pct_threshold': {'type': '_empty', 'default': 20}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance'] ['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift ROC Curve Drift Compares ROC curves between reference and monitoring datasets.... True False ['datasets', 'model'] {} ['sklearn', 'binary_classification', 'model_performance', 'visualization'] ['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

  • model_validation.sklearn.ClassifierPerformance
  • model_validation.sklearn.ConfusionMatrix
  • model_validation.sklearn.MinimumAccuracy
  • model_validation.sklearn.MinimumF1Score
  • model_validation.sklearn.ROCCurve

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

  • The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
  • The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.
for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

The Classifier Performance test evaluates binary classification performance using precision, recall, F1-score, accuracy, and ROC AUC. The reported results present class-level precision, recall, and F1 values for classes 0 and 1, along with weighted and macro averages across classes. A separate summary table reports overall accuracy of 0.6507 and ROC AUC of 0.6971. Together, these outputs provide both threshold-based classification metrics and the model’s ranking performance across the two classes.

Key insights:

  • Class-level performance is balanced: Precision, recall, and F1 are identical within each class, with class 0 at 0.6367 and class 1 at 0.6637. The difference between classes is limited to 0.0270 across these metrics.
  • Aggregate metrics are consistent: The weighted average precision, recall, and F1 are all 0.6507, while the macro averages are all 0.6502. The close alignment between weighted and macro averages indicates limited dispersion in performance across the two classes.
  • Overall accuracy matches weighted scores: Accuracy is reported at 0.6507, matching the weighted average precision, recall, and F1. This consistency shows that the aggregate threshold-based metrics are numerically aligned in the reported evaluation.
  • ROC AUC exceeds classification metrics: ROC AUC is 0.6971, which is higher than the reported accuracy and average F1 values near 0.65. This indicates stronger class-ranking performance than the threshold-based summary metrics alone reflect.

The results show a broadly even level of classification performance across both classes, with class 1 modestly outperforming class 0 on precision, recall, and F1. Aggregate measures are tightly clustered around 0.65, and the similarity between macro and weighted averages indicates that overall performance is not being driven by a large disparity between classes. The ROC AUC of 0.6971 is higher than the threshold-based aggregate metrics, indicating comparatively stronger discrimination than the single-threshold classification summaries alone.

Tables

Precision, Recall, and F1

Class Precision Recall F1
0 0.6367 0.6367 0.6367
1 0.6637 0.6637 0.6637
Weighted Average 0.6507 0.6507 0.6507
Macro Average 0.6502 0.6502 0.6502

Accuracy and ROC AUC

Metric Value
Accuracy 0.6507
ROC AUC 0.6971
2026-05-26 22:12:14,882 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

The Confusion Matrix test evaluates the classification model’s predictive performance by comparing predicted class labels with observed class labels and organizing the results into true positives, true negatives, false positives, and false negatives. The confusion matrix for logreg_champion shows 223 true positives, 198 true negatives, 113 false positives, and 113 false negatives. The figure presents these outcomes across the two classes, allowing direct comparison between correct classifications and each error type.

Key insights:

  • Correct predictions exceed errors: The model records 421 correct classifications in total (223 true positives and 198 true negatives) versus 226 misclassifications (113 false positives and 113 false negatives), indicating that correct assignments are more frequent than errors.

  • Error counts are balanced: False positives and false negatives are both 113. This shows that the two error types occur at the same observed frequency in this test sample.

  • Positive class captures more correct cases: True positives total 223, which is higher than the 198 true negatives. Within the observed results, the model correctly identifies more positive-class cases than negative-class cases.

  • Observed class totals differ: The matrix implies 336 observed positive cases (223 true positives + 113 false negatives) and 311 observed negative cases (198 true negatives + 113 false positives). This indicates a slightly larger count of positive cases than negative cases in the evaluated sample.

The confusion matrix shows that the model produces more correct classifications than incorrect ones, with correct predictions present in both classes. Misclassification is evenly split between false positives and false negatives, while correct positive classifications exceed correct negative classifications. The evaluated sample also contains slightly more observed positive cases than negative cases, which provides context for the distribution of outcomes shown in the matrix.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:d2b9
2026-05-26 22:12:25,139 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

The Minimum Accuracy test evaluates whether the model’s prediction accuracy meets or exceeds a predefined threshold. The result table reports an accuracy score of 0.6507 against a threshold of 0.7000, along with the corresponding pass/fail outcome. This output provides the observed accuracy level used for the threshold comparison and the resulting test determination.

Key insights:

  • Accuracy is below threshold: The observed accuracy score is 0.6507, which is lower than the configured minimum threshold of 0.7000.
  • Test result is fail: The threshold comparison resulted in a failing outcome, as indicated directly in the result table.
  • Gap to threshold is 0.0493: The measured shortfall relative to the minimum threshold is 0.0493 based on the difference between 0.7000 and 0.6507.

The result shows that the model’s observed classification accuracy did not meet the minimum level defined for this test. The recorded accuracy of 0.6507 falls 0.0493 below the threshold of 0.7000, and the test outcome is therefore documented as Fail.

Tables

Score Threshold Pass/Fail
0.6507 0.7 Fail
2026-05-26 22:12:31,499 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

The MinimumF1Score test evaluates whether the model’s F1 score on the validation dataset meets a predefined minimum threshold. The result table reports a validation F1 score of 0.6637 alongside a threshold of 0.5 and a pass/fail outcome. These values show the measured score and the benchmark used for assessment in this test.

Key insights:

  • Threshold was exceeded: The validation F1 score is 0.6637 compared with a minimum threshold of 0.5, placing the observed result above the required cutoff.
  • Test outcome is pass: The reported pass/fail status is "Pass," consistent with the observed relationship between the score and threshold.
  • Positive margin over minimum: The observed F1 score exceeds the threshold by 0.1637, indicating performance above the configured minimum level for this metric.

The result shows that the model achieved an F1 score of 0.6637 on the validation dataset against a minimum threshold of 0.5. The score is above the defined cutoff by 0.1637, and the recorded outcome is a pass. Collectively, the test output indicates that the model satisfied the minimum performance criterion defined for this F1 score assessment.

Tables

Score Threshold Pass/Fail
0.6637 0.5 Pass
2026-05-26 22:12:36,981 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

The ROCCurve test evaluates binary classification performance by plotting the Receiver Operating Characteristic curve and calculating the Area Under the Curve (AUC). The result shows the ROC curve for log_model_champion on test_dataset_final, with the true positive rate plotted against the false positive rate across classification thresholds. The chart also includes a random-classification reference line with AUC = 0.5, and the model’s reported AUC is 0.70.

Key insights:

  • AUC indicates measurable discrimination: The plotted ROC curve reports an AUC of 0.70, compared with the random benchmark of 0.5 shown on the chart.
  • ROC curve remains above random baseline: Across most of the false positive rate range, the ROC curve lies above the diagonal reference line, indicating stronger ranking performance than random classification.
  • Performance improves progressively with threshold relaxation: The true positive rate increases steadily as the false positive rate rises, with the curve approaching a true positive rate near 1.0 at high false positive rates.

The ROC results show that log_model_champion demonstrates discrimination above the random benchmark on the test dataset, as reflected by the AUC of 0.70 and the ROC curve’s position above the diagonal baseline. The curve exhibits a steady gain in true positive rate as the allowable false positive rate increases, indicating consistent separation between the two classes across thresholds.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:5b4c
2026-05-26 22:12:46,059 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Learn more: Add and manage artifacts):

  1. From the Inventory in the ValidMind Platform, go to the model you connected to earlier.

  2. In the left sidebar that appears for your model, click Validation under Documents.

  3. Click on 2.2.2. Model Performance to expand that section.

  4. Under the Model Performance Metrics guideline, click to expand the Artifacts panel.

  5. Click Link Artifact and select Validation Issue as the type of artifact.

  6. Click + Add Validation Issue and enter in the details for your validation issue, for example:

    • TITLE — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
    • RISK AREA — Model Performance
    • DOCUMENTATION SECTION — 3.2. Model Evaluation
    • DESCRIPTION — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
  7. Click Add Validation Issue to submit the validation issue.

  8. Select the validation issue you just added to link to your validation report.

  9. Click Update Linked Artifacts to insert your validation issue.

  10. Confirm that the validation issue you inserted has been correctly inserted into section 2.2.2. Model Performance of the report.

  11. Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the development team for our champion, with the aim of verifying their test results.

Next, let's see how our challengers compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

The Classifier Performance test evaluates classification model performance using precision, recall, F1-score, accuracy, and ROC AUC. The results compare log_model_champion and rf_model across class-level, macro-average, weighted-average, and overall metrics. log_model_champion reports class-level precision, recall, and F1 values of 0.6367 for class 0 and 0.6637 for class 1, with weighted-average and accuracy values of 0.6507 and ROC AUC of 0.6971. rf_model reports class 0 precision/recall/F1 of 0.6716/0.7363/0.7025 and class 1 precision/recall/F1 of 0.7320/0.6667/0.6978, with weighted-average F1 of 0.7000, accuracy of 0.7002, and ROC AUC of 0.7839.

Key insights:

  • Random forest outperforms champion overall: rf_model exceeds log_model_champion on accuracy (0.7002 vs. 0.6507), ROC AUC (0.7839 vs. 0.6971), weighted-average precision (0.7030 vs. 0.6507), weighted-average recall (0.7002 vs. 0.6507), and weighted-average F1 (0.7000 vs. 0.6507).

  • Champion shows fully balanced class metrics: For log_model_champion, precision, recall, and F1 are identical within each class and closely aligned across classes, ranging from 0.6367 to 0.6637. Its macro-average and weighted-average metrics are also nearly identical at approximately 0.65.

  • Random forest exhibits class trade-offs: In rf_model, class 0 recall is higher at 0.7363 while class 1 recall is lower at 0.6667. Precision moves in the opposite direction, with class 1 precision at 0.7320 versus 0.6716 for class 0, indicating uneven error distribution across classes.

  • Average metrics remain stable for both models: The gap between macro-average and weighted-average metrics is very small for both models. For log_model_champion, macro-average F1 is 0.6502 and weighted-average F1 is 0.6507; for rf_model, macro-average F1 is 0.7001 and weighted-average F1 is 0.7000.

The comparison shows that rf_model delivers stronger aggregate classification performance than log_model_champion across all reported overall metrics, with the largest difference appearing in ROC AUC. At the same time, log_model_champion presents more uniform class-level performance, while rf_model shows stronger overall results alongside a clearer trade-off between class-specific precision and recall. The close alignment between macro and weighted averages for both models indicates limited divergence between these aggregate views in the reported results.

Tables

model Class Precision Recall F1
log_model_champion 0 0.6367 0.6367 0.6367
log_model_champion 1 0.6637 0.6637 0.6637
log_model_champion Weighted Average 0.6507 0.6507 0.6507
log_model_champion Macro Average 0.6502 0.6502 0.6502
rf_model 0 0.6716 0.7363 0.7025
rf_model 1 0.7320 0.6667 0.6978
rf_model Weighted Average 0.7030 0.7002 0.7000
rf_model Macro Average 0.7018 0.7015 0.7001
model Metric Value
log_model_champion Accuracy 0.6507
log_model_champion ROC AUC 0.6971
rf_model Accuracy 0.7002
rf_model ROC AUC 0.7839
2026-05-26 22:12:56,868 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

The Confusion Matrix test evaluates classification performance by comparing predicted labels with true labels and displaying the resulting counts of true positives, true negatives, false positives, and false negatives. The results show confusion matrices for two models: log_model_champion and rf_model. For log_model_champion, the matrix contains 223 true positives, 198 true negatives, 113 false positives, and 113 false negatives. For rf_model, the matrix contains 224 true positives, 229 true negatives, 82 false positives, and 112 false negatives.

Key insights:

  • Random forest reduces false positives: rf_model records 82 false positives compared with 113 for log_model_champion, a reduction of 31 cases.
  • Random forest improves true negatives: True negatives increase from 198 in log_model_champion to 229 in rf_model, indicating stronger identification of class 0 observations.
  • Positive-class detection is nearly unchanged: True positives differ by 1 case between models (223 vs. 224), and false negatives differ by 1 case (113 vs. 112), indicating very similar classification of class 1 observations.
  • Overall correct classifications are higher for rf_model: Summing diagonal entries, rf_model produces 453 correct classifications versus 421 for log_model_champion.

Across the two confusion matrices, rf_model shows improved classification performance relative to log_model_champion, driven primarily by stronger classification of negative cases. The largest differences appear in the reduction of false positives and the corresponding increase in true negatives, while classification outcomes for positive cases remain effectively unchanged between the two models. Overall, the observed improvement is concentrated in class 0 predictions rather than class 1 predictions.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:e338
ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:7e5a
2026-05-26 22:13:09,704 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

The Minimum Accuracy test evaluates whether each model’s prediction accuracy meets or exceeds the specified threshold. The result table compares observed accuracy scores against a threshold of 0.7 for two models: log_model_champion and rf_model. Reported scores are 0.6507 and 0.7002, respectively, with corresponding pass/fail outcomes based on whether the threshold is met.

Key insights:

  • Only one model meets threshold: rf_model achieves an accuracy score of 0.7002 against the 0.7 threshold and is marked as Pass, while log_model_champion is marked as Fail.
  • Champion model falls below minimum accuracy: log_model_champion records an accuracy score of 0.6507, which is 0.0493 below the stated threshold.
  • Passing margin is minimal: rf_model exceeds the threshold by 0.0002, indicating that its passing result is effectively at the cutoff level.
  • Accuracy gap favors challenger: The difference in accuracy between rf_model and log_model_champion is 0.0495, with the higher score observed for rf_model.

The results show a split outcome across the two evaluated models under the same minimum accuracy criterion. rf_model satisfies the threshold with a score just above the cutoff, whereas log_model_champion remains below the required level. The observed ranking by accuracy places rf_model above log_model_champion, with a gap of 0.0495 between their reported scores.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6507 0.7 Fail
rf_model 0.7002 0.7 Pass
2026-05-26 22:13:18,522 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

The MinimumF1Score test evaluates whether each model’s validation-set F1 score meets a predefined minimum threshold. The results table reports the F1 score, threshold, and pass/fail outcome for the evaluated models. Two models are shown: log_model_champion with an F1 score of 0.6637 and rf_model with an F1 score of 0.6978, and both are assessed against the same threshold of 0.5. The table indicates a passing outcome for both models.

Key insights:

  • Both models exceed threshold: log_model_champion and rf_model both post validation F1 scores above the 0.5 minimum threshold, with recorded outcomes of Pass for each model.
  • Random forest has higher F1: rf_model achieves an F1 score of 0.6978 compared with 0.6637 for log_model_champion, a difference of 0.0341.
  • Consistent evaluation standard applied: Both models are evaluated against the same minimum threshold of 0.5, allowing direct comparison of their reported F1 performance.

The test results show that both evaluated models satisfy the configured minimum F1 score requirement on the validation dataset. Within this comparison, rf_model records the higher F1 score, while log_model_champion also remains above the threshold by a clear margin. The observed outcomes indicate that both models passed this test under the same scoring criterion.

Tables

model Score Threshold Pass/Fail
log_model_champion 0.6637 0.5 Pass
rf_model 0.6978 0.5 Pass
2026-05-26 22:13:24,122 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

The ROCCurve:champion_vs_challenger test evaluates binary classification performance by plotting the ROC curve and calculating AUC for each model on the test dataset. The results show ROC curves for log_model_champion and rf_model, each compared against the random-classification baseline. The chart reports an AUC of 0.70 for log_model_champion and 0.78 for rf_model, with both curves remaining above the diagonal reference line across the plotted range.

Key insights:

  • Random forest shows higher AUC: rf_model records an AUC of 0.78 versus 0.70 for log_model_champion, indicating stronger ranking performance on the test dataset.
  • Both models outperform random baseline: Each ROC curve lies above the 0.5 random benchmark, and both reported AUC values exceed 0.5.
  • Separation is consistent across thresholds: The rf_model ROC curve remains visibly above the log_model_champion curve through much of the false-positive-rate range, reflecting stronger true-positive capture for comparable false-positive levels.

The ROC results indicate that both evaluated models demonstrate discriminative ability on the test dataset, with performance above the random baseline. Among the two, rf_model exhibits stronger overall discrimination, as reflected by its higher AUC and higher ROC trajectory across most thresholds. The observed difference between AUC values is 0.08 in favor of rf_model.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:fad6
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:22e7
2026-05-26 22:13:36,330 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document
Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")
ID Name Description Has Figure Has Table Required Inputs Params Tags Tasks
validmind.model_validation.sklearn.OverfitDiagnosis Overfit Diagnosis Assesses potential overfitting in a model's predictions, identifying regions where performance between training and... True True ['model', 'datasets'] {'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}} ['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis'] ['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis Robustness Diagnosis Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions.... True True ['datasets', 'model'] {'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}} ['sklearn', 'model_diagnosis', 'visualization'] ['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis Weakspots Diagnosis Identifies and visualizes weak spots in a machine learning model's performance across various sections of the... True True ['datasets', 'model'] {'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}} ['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization'] ['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the model_validation.sklearn.OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

  • Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
  • The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.
vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

The Overfit Diagnosis test evaluates whether training and test performance diverge materially within feature-level segments by comparing AUC across binned slices using a 0.04 threshold. The results are reported separately for log_model_champion and rf_model, with each row identifying a feature slice, the associated training and test record counts, and the resulting AUC gap. For log_model_champion, only a limited set of slices appear above the threshold, while for rf_model the reported gaps are broader across most monitored features and slices. The accompanying plots visualize these segment-level gaps relative to the cut-off and align with the tabulated results.

Key insights:

  • Broader gap pattern in rf_model: rf_model shows gaps above 0.04 across all reported monitored features, including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Geography_Germany, Geography_Spain, and Gender_Male. In these slices, training AUC is consistently 1.0, while test AUC ranges from 0.0 to 0.9139.

  • log_model_champion has fewer flagged slices: For log_model_champion, reported gaps above threshold are limited to specific slices in CreditScore, Tenure, Balance, NumOfProducts, and EstimatedSalary. No flagged slices are reported in HasCrCard, IsActiveMember, Geography_Germany, Geography_Spain, or Gender_Male.

  • Largest gaps occur in sparse slices: The largest observed gaps are concentrated in low-count segments. For rf_model, Balance (200718.472, 225808.281] has a gap of 1.0 with 16 training and 2 test records, and NumOfProducts (2.8, 3.1] has a gap of 1.0 with 151 training and 35 test records. For log_model_champion, NumOfProducts (2.8, 3.1] has the largest gap at 0.6218 with 151 training and 35 test records, and Balance (25089.809, 50179.618] shows a gap of 0.4487 with 19 training and 3 test records.

  • Balance shows the strongest localized divergence: Balance contains several of the largest reported gaps for both models. In log_model_champion, notable slices include (150538.854, 175628.663] with gap 0.0815 and (175628.663, 200718.472] with gap 0.3701; in rf_model, multiple balance bins exceed 0.20, reaching 0.4173, 0.6389, and 1.0 in higher-balance ranges.

  • CreditScore and Tenure are widely flagged in rf_model: For rf_model, every reported CreditScore slice from (450.0, 500.0] through (800.0, 850.0] exceeds the threshold, with gaps ranging from 0.12 to 0.2949. Every reported Tenure slice also exceeds the threshold, with gaps from 0.0861 to 0.2985.

  • EstimatedSalary differs by model: In log_model_champion, only two EstimatedSalary slices exceed the threshold: (119994.202, 139977.944] with gap 0.0541 and (159961.686, 179945.428] with gap 0.0506. In rf_model, all reported EstimatedSalary slices exceed the threshold, with gaps ranging from 0.1433 to 0.3138.

The test results show a clear contrast between the two models at the segment level. log_model_champion exhibits threshold exceedances in a relatively small number of localized slices, whereas rf_model shows widespread training-test AUC separation across nearly all reported features, with training AUC fixed at 1.0 in each flagged segment. The largest divergences are concentrated in Balance and NumOfProducts, particularly in slices with small test sample counts, while CreditScore, Tenure, and EstimatedSalary also show consistently elevated gaps for rf_model.

Tables

model Feature Slice Number of Training Records Number of Test Records Training AUC Test AUC Gap
log_model_champion CreditScore (500.0, 550.0] 270 62 0.6685 0.6205 0.0480
log_model_champion CreditScore (550.0, 600.0] 392 83 0.7006 0.6593 0.0413
log_model_champion Tenure (5.0, 6.0] 249 68 0.6899 0.6333 0.0565
log_model_champion Balance (25089.809, 50179.618] 19 3 0.4487 0.0000 0.4487
log_model_champion Balance (150538.854, 175628.663] 183 54 0.6271 0.5456 0.0815
log_model_champion Balance (175628.663, 200718.472] 59 9 0.5368 0.1667 0.3701
log_model_champion NumOfProducts (2.8, 3.1] 151 35 0.6218 0.0000 0.6218
log_model_champion EstimatedSalary (119994.202, 139977.944] 252 67 0.7091 0.6551 0.0541
log_model_champion EstimatedSalary (159961.686, 179945.428] 266 65 0.6773 0.6267 0.0506
rf_model CreditScore (450.0, 500.0] 106 33 1.0000 0.7877 0.2123
rf_model CreditScore (500.0, 550.0] 270 62 1.0000 0.7051 0.2949
rf_model CreditScore (550.0, 600.0] 392 83 1.0000 0.8070 0.1930
rf_model CreditScore (600.0, 650.0] 469 119 1.0000 0.7504 0.2496
rf_model CreditScore (650.0, 700.0] 501 127 1.0000 0.7077 0.2923
rf_model CreditScore (700.0, 750.0] 386 100 1.0000 0.8347 0.1653
rf_model CreditScore (750.0, 800.0] 237 67 1.0000 0.8165 0.1835
rf_model CreditScore (800.0, 850.0] 178 35 1.0000 0.8800 0.1200
rf_model Tenure (-0.01, 1.0] 395 89 1.0000 0.7015 0.2985
rf_model Tenure (1.0, 2.0] 267 63 1.0000 0.8133 0.1867
rf_model Tenure (2.0, 3.0] 268 69 1.0000 0.7894 0.2106
rf_model Tenure (3.0, 4.0] 270 63 1.0000 0.7958 0.2042
rf_model Tenure (4.0, 5.0] 267 66 1.0000 0.8006 0.1994
rf_model Tenure (5.0, 6.0] 249 68 1.0000 0.7627 0.2373
rf_model Tenure (6.0, 7.0] 236 68 1.0000 0.7967 0.2033
rf_model Tenure (7.0, 8.0] 255 71 1.0000 0.7069 0.2931
rf_model Tenure (8.0, 9.0] 253 60 1.0000 0.8383 0.1617
rf_model Tenure (9.0, 10.0] 125 30 1.0000 0.9139 0.0861
rf_model Balance (-250.898, 25089.809] 848 190 1.0000 0.8632 0.1368
rf_model Balance (25089.809, 50179.618] 19 3 1.0000 0.5000 0.5000
rf_model Balance (50179.618, 75269.427] 91 17 1.0000 0.6429 0.3571
rf_model Balance (75269.427, 100359.236] 298 76 1.0000 0.7651 0.2349
rf_model Balance (100359.236, 125449.045] 598 145 1.0000 0.7592 0.2408
rf_model Balance (125449.045, 150538.854] 471 151 1.0000 0.7363 0.2637
rf_model Balance (150538.854, 175628.663] 183 54 1.0000 0.5827 0.4173
rf_model Balance (175628.663, 200718.472] 59 9 1.0000 0.3611 0.6389
rf_model Balance (200718.472, 225808.281] 16 2 1.0000 0.0000 1.0000
rf_model NumOfProducts (0.997, 1.3] 1502 369 1.0000 0.6822 0.3178
rf_model NumOfProducts (1.9, 2.2] 898 233 1.0000 0.7143 0.2857
rf_model NumOfProducts (2.8, 3.1] 151 35 1.0000 0.0000 1.0000
rf_model HasCrCard (-0.001, 0.1] 763 192 1.0000 0.7229 0.2771
rf_model HasCrCard (0.9, 1.0] 1822 455 1.0000 0.8081 0.1919
rf_model IsActiveMember (-0.001, 0.1] 1356 341 1.0000 0.7583 0.2417
rf_model IsActiveMember (0.9, 1.0] 1229 306 1.0000 0.7725 0.2275
rf_model EstimatedSalary (-108.087, 20075.492] 265 65 1.0000 0.8036 0.1964
rf_model EstimatedSalary (20075.492, 40059.234] 260 50 1.0000 0.7866 0.2134
rf_model EstimatedSalary (40059.234, 60042.976] 262 68 1.0000 0.6964 0.3036
rf_model EstimatedSalary (60042.976, 80026.718] 271 58 1.0000 0.8316 0.1684
rf_model EstimatedSalary (80026.718, 100010.46] 255 76 1.0000 0.8302 0.1698
rf_model EstimatedSalary (100010.46, 119994.202] 272 65 1.0000 0.7452 0.2548
rf_model EstimatedSalary (119994.202, 139977.944] 252 67 1.0000 0.8320 0.1680
rf_model EstimatedSalary (139977.944, 159961.686] 249 67 1.0000 0.7818 0.2182
rf_model EstimatedSalary (159961.686, 179945.428] 266 65 1.0000 0.6862 0.3138
rf_model EstimatedSalary (179945.428, 199929.17] 233 66 1.0000 0.8567 0.1433
rf_model Geography_Germany (-0.001, 0.1] 1814 422 1.0000 0.7744 0.2256
rf_model Geography_Germany (0.9, 1.0] 771 225 1.0000 0.7510 0.2490
rf_model Geography_Spain (-0.001, 0.1] 1961 510 1.0000 0.7853 0.2147
rf_model Geography_Spain (0.9, 1.0] 624 137 1.0000 0.7757 0.2243
rf_model Gender_Male (-0.001, 0.1] 1258 333 1.0000 0.7704 0.2296
rf_model Gender_Male (0.9, 1.0] 1327 314 1.0000 0.7903 0.2097

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2175
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:7b75
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6e45
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:cdb7
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0e37
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:561d
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:aaee
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c615
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:971b
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2cb4
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:b42c
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:f2b3
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:f6a5
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:9814
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:d3f8
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:94ce
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1c50
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:051b
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:14a3
ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:34ee
2026-05-26 22:14:01,653 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the model_validation.sklearn.RobustnessDiagnosis test.

Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

The Robustness Diagnosis test evaluates model resilience by measuring AUC decay after adding Gaussian noise to numeric input features at increasing perturbation sizes. The results report baseline and perturbed AUC values for both log_model_champion and rf_model on the training and test datasets, along with performance decay and pass/fail outcomes at each noise level. For log_model_champion, AUC values remain in a narrow range across perturbation sizes on both datasets. For rf_model, the results show a larger decline in training AUC as perturbation size increases and a sharper decline in test AUC at the highest perturbation level.

Key insights:

  • Logistic model shows limited AUC decay: log_model_champion starts at AUC 0.6828 on train and 0.6971 on test at baseline, and remains between 0.6671 and 0.6817 on train and between 0.6768 and 0.6967 on test across perturbation sizes 0.1 to 0.5. Reported performance decay stays within 0.0158 on train and 0.0203 on test, and all runs pass.

  • Random forest degrades materially on train data: rf_model baseline train AUC is 1.0000 and declines to 0.9848, 0.9390, 0.8956, 0.8484, and 0.8016 as perturbation size increases from 0.1 to 0.5. Corresponding train performance decay rises from 0.0152 to 0.1984, with failed results beginning at perturbation size 0.2 and continuing through 0.5.

  • Random forest test performance drops sharply at highest noise: On the test dataset, rf_model begins at AUC 0.7839 and remains between 0.7560 and 0.7764 through perturbation sizes 0.1 to 0.4, before falling to 0.6713 at perturbation size 0.5. Performance decay reaches 0.1126 at 0.5, and this is the only failed test result for rf_model on the test dataset.

  • Baseline train-test separation differs by model: At baseline, log_model_champion shows similar train and test AUC values (0.6828 vs. 0.6971), while rf_model shows a wider gap between train and test performance (1.0000 vs. 0.7839). This separation remains visible across perturbation levels, particularly on the random forest training results.

The robustness results indicate distinct noise sensitivity profiles for the two models. log_model_champion exhibits small AUC changes across all evaluated perturbation sizes and passes all train and test checks, while rf_model shows substantial degradation on the training dataset beginning at moderate perturbation levels and a pronounced test-set decline at the highest noise level. Across the reported results, the logistic model remains comparatively stable under Gaussian perturbation, whereas the random forest model displays greater sensitivity as noise intensity increases.

Tables

model Perturbation Size Dataset Row Count AUC Performance Decay Passed
log_model_champion Baseline (0.0) train_dataset_final 2585 0.6828 0.0000 True
log_model_champion Baseline (0.0) test_dataset_final 647 0.6971 0.0000 True
log_model_champion 0.1 train_dataset_final 2585 0.6817 0.0012 True
log_model_champion 0.1 test_dataset_final 647 0.6912 0.0059 True
log_model_champion 0.2 train_dataset_final 2585 0.6810 0.0018 True
log_model_champion 0.2 test_dataset_final 647 0.6910 0.0062 True
log_model_champion 0.3 train_dataset_final 2585 0.6752 0.0076 True
log_model_champion 0.3 test_dataset_final 647 0.6869 0.0103 True
log_model_champion 0.4 train_dataset_final 2585 0.6671 0.0158 True
log_model_champion 0.4 test_dataset_final 647 0.6768 0.0203 True
log_model_champion 0.5 train_dataset_final 2585 0.6673 0.0155 True
log_model_champion 0.5 test_dataset_final 647 0.6967 0.0004 True
rf_model Baseline (0.0) train_dataset_final 2585 1.0000 0.0000 True
rf_model Baseline (0.0) test_dataset_final 647 0.7839 0.0000 True
rf_model 0.1 train_dataset_final 2585 0.9848 0.0152 True
rf_model 0.1 test_dataset_final 647 0.7756 0.0083 True
rf_model 0.2 train_dataset_final 2585 0.9390 0.0610 False
rf_model 0.2 test_dataset_final 647 0.7681 0.0158 True
rf_model 0.3 train_dataset_final 2585 0.8956 0.1044 False
rf_model 0.3 test_dataset_final 647 0.7764 0.0075 True
rf_model 0.4 train_dataset_final 2585 0.8484 0.1516 False
rf_model 0.4 test_dataset_final 647 0.7560 0.0279 True
rf_model 0.5 train_dataset_final 2585 0.8016 0.1984 False
rf_model 0.5 test_dataset_final 647 0.6713 0.1126 False

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:a971
ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:41bb
2026-05-26 22:14:22,747 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI
['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

The FeaturesAUC test evaluates the discriminatory power of each individual feature by computing its standalone AUC against the binary target. The result is presented as a ranked bar chart for test_dataset_final, showing feature-level AUC values across the input variables. Observed AUCs range from approximately 0.39 to 0.61, with features ordered from the highest to the lowest univariate discrimination.

Key insights:

  • Geography_Germany has the highest AUC: Geography_Germany is the top-ranked feature with an AUC slightly above 0.60, making it the strongest standalone discriminator among the features shown.

  • Balance is the second strongest feature: Balance follows closely behind, with an AUC just under 0.60, indicating comparatively strong univariate separation relative to the remaining variables.

  • Most features cluster near 0.45 to 0.50: HasCrCard, Tenure, EstimatedSalary, Geography_Spain, Gender_Male, CreditScore, and NumOfProducts are concentrated in a narrow band around the mid-0.40s to low-0.50s, indicating similar standalone discriminatory strength across this group.

  • IsActiveMember is the weakest feature shown: IsActiveMember has the lowest AUC, at just under 0.40, placing it at the bottom of the ranking in this test output.

The univariate AUC profile shows a clear ranking led by Geography_Germany and Balance, which exhibit the strongest standalone discrimination in test_dataset_final. Most other features form a middle tier with relatively similar AUC values near 0.45 to 0.50, while IsActiveMember is separated as the lowest-scoring feature. Overall, the result indicates uneven standalone discriminatory power across the feature set, with a small number of features contributing stronger individual separation than the remainder.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:e0fe
ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:43a0
2026-05-26 22:14:41,852 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

The Permutation Feature Importance test evaluates the significance of each input feature by measuring the change in model performance after that feature is randomly permuted. The results are shown separately for the champion logistic model and the challenger random forest model, with bar lengths indicating the relative effect of each feature on performance. Across both plots, feature importance is concentrated in a subset of variables, while several features have values near zero or below zero. The ranking and magnitude of importances differ materially between the two models.

Key insights:

  • Champion model is driven by geography and activity: In the logistic champion model, Geography_Germany has the largest permutation importance at approximately 0.05, followed closely by IsActiveMember at roughly 0.047. Gender_Male is the next most influential feature at about 0.026, with all remaining positive importances substantially smaller.

  • Challenger model is dominated by product count: In the random forest challenger, NumOfProducts is the most important feature by a wide margin at approximately 0.14. This is materially larger than the next tier of features, where Balance and Geography_Germany are both near 0.04.

  • Importance rankings differ across models: NumOfProducts has minimal importance in the champion model but is the dominant driver in the challenger. Conversely, IsActiveMember is one of the strongest features in the champion model but is notably smaller in the challenger at roughly 0.017.

  • Several features contribute little or negatively: Both models show features with near-zero or negative permutation importance. In the champion model, CreditScore and EstimatedSalary are negative, while Tenure is approximately neutral; in the challenger model, HasCrCard, Geography_Spain, and EstimatedSalary are negative, and Gender_Male and CreditScore are close to zero.

  • Champion importance is more evenly distributed: The champion model spreads importance across three main features (Geography_Germany, IsActiveMember, and Gender_Male) before dropping to smaller values. The challenger is more concentrated, with NumOfProducts substantially exceeding every other feature.

The permutation importance results indicate that the champion and challenger models rely on different feature structures. The champion logistic model is primarily influenced by Geography_Germany and IsActiveMember, with a secondary contribution from Gender_Male, whereas the challenger random forest is heavily concentrated in NumOfProducts with moderate contribution from Balance and Geography_Germany. In both models, multiple variables have limited or negative measured contribution under permutation, indicating that predictive signal is concentrated in a narrower subset of the available inputs.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:25fb
ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:4803
2026-05-26 22:15:06,871 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

The SHAP Global Importance test evaluates global feature importance by summarizing the magnitude and direction of SHAP contributions across model inputs. The results shown here compare a champion logistic model and a challenger random forest model using normalized feature-importance bars and SHAP summary-style distributions. For the champion model, the plots display a ranked importance ordering across ten features and the spread of each feature’s SHAP values around zero. For the challenger model, the figures shown are limited to CreditScore and Tenure, including both normalized SHAP value distributions and SHAP interaction value distributions.

Key insights:

  • Champion importance is highly concentrated: In the champion model, IsActiveMember has the largest normalized importance at approximately 100, followed by Geography_Germany near 86 and Gender_Male near 74. A second tier begins with Balance near 43, while the remaining features are materially smaller, with Geography_Spain the smallest at roughly 6.

  • Top champion features show directional separation: The champion summary plot shows clear left-right separation for several leading features. For IsActiveMember, high feature values are concentrated on the negative SHAP side while low values are concentrated on the positive side; Geography_Germany and Gender_Male show the opposite pattern, with high values appearing on the positive SHAP side and low values on the negative side.

  • Balance has the broadest champion spread: Balance exhibits the widest SHAP dispersion among the champion model’s continuous variables, extending from slightly negative values to strongly positive values, with most higher feature values appearing on the positive SHAP side. Other continuous features such as CreditScore, Tenure, and EstimatedSalary are more tightly clustered around zero by comparison.

  • Several champion features have limited marginal impact: EstimatedSalary, HasCrCard, and Geography_Spain have small normalized importance values and their SHAP points remain close to zero. This indicates comparatively low contribution magnitude in the champion model relative to the leading variables.

  • Challenger plots are narrow in scope: The challenger model figures shown include only CreditScore and Tenure. In both the normalized SHAP value view and the SHAP interaction view, the point clouds for these features are centered close to zero, with Tenure appearing more tightly concentrated than CreditScore.

  • Challenger interaction effects appear limited in displayed features: In the challenger SHAP interaction plot, the displayed interaction values for CreditScore and Tenure remain within a narrow range around zero. No broad interaction spread is visible in the provided figure for these two features.

The SHAP results show a strongly differentiated importance structure in the champion model, with IsActiveMember, Geography_Germany, and Gender_Male accounting for the largest observed contribution magnitudes and with clear directional separation visible for the top-ranked features. The remaining champion features contribute at progressively smaller levels, with several clustered near zero impact. For the challenger model, the provided SHAP outputs are restricted to CreditScore and Tenure, and within those displayed views both marginal and interaction contributions appear tightly centered around zero.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:3f2d
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:8d6e
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:c91a
ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:839c
2026-05-26 22:15:24,939 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial