ValidMind for validation 3 — Developing a potential challenger

Learn how to use ValidMind for your end-to-end validation process with our series of four introductory notebooks. In this third notebook, develop a potential challenger and then pass your challenger and its predictions to ValidMind.

A challenger is an alternate record (model) that attempts to outperform the champion, ensuring that the best performing fit-for-purpose record is always considered for deployment. Challengers also help avoid over-reliance on a single record, and allow testing of new features, algorithms, or data sources without disrupting the production lifecycle.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to develop potential challengers with this notebook, you'll need to first have:

Registered a model within the ValidMind Platform and granted yourself access to the model as a validator
Installed the ValidMind Library in your local environment, allowing you to access all its features
Learned how to import and initialize datasets for use with ValidMind
Understood the basics of how to run and log tests with ValidMind
Run data quality tests on the datasets used to train the champion, and logged the results of those tests to ValidMind
Inserted your logged test results into your validation report

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the validation process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

On the left sidebar that appears for your model, select Getting Started and select Validation from the Document drop-down menu.
Click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    document="validation-report",
)

Note: you may need to restart the kernel to use updated packages.

2026-07-14 05:34:01,158 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the sample Bank Customer Churn Prediction dataset used to develop the champion that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

Preprocess the dataset

We’ll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you know, before we can run tests you’ll need to initialize a ValidMind dataset object with the init_dataset function:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify highly correlated variable pairs that may indicate redundancy or multicollinearity. The results table lists the top 10 strongest correlations, showing each feature pair, its Pearson correlation coefficient, and a Pass/Fail outcome based on the absolute correlation threshold of 0.3. Reported coefficients range from -0.1892 to 0.3617, with one pair exceeding the threshold and the remaining pairs classified as Pass.

Key insights:

One pair exceeds threshold: The pair (Age, Exited) has a correlation coefficient of 0.3617, which is the only reported relationship that breaches the 0.3 threshold and is therefore marked Fail.
Remaining correlations are weak: All other listed feature pairs have absolute correlation values below 0.19, including (IsActiveMember, Exited) at -0.1892 and (Balance, NumOfProducts) at -0.1729, and are marked Pass.
Observed relationships are mostly small in magnitude: Outside the top-ranked pair, the reported coefficients are clustered near zero, including values such as 0.0569, 0.0564, -0.0348, -0.0322, and -0.0290.

The reported correlation structure is concentrated in a single above-threshold relationship between Age and Exited, while the remaining top correlations are all below the configured limit and relatively small in magnitude. Across the displayed results, the strongest non-failing correlations remain materially lower than the threshold, indicating that elevated linear association is limited within the reported top pairs.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3617	Fail
(IsActiveMember, Exited)	-0.1892	Pass
(Balance, NumOfProducts)	-0.1729	Pass
(Balance, Exited)	0.1578	Pass
(NumOfProducts, Exited)	-0.0664	Pass
(NumOfProducts, IsActiveMember)	0.0569	Pass
(Age, Balance)	0.0564	Pass
(Age, NumOfProducts)	-0.0348	Pass
(Balance, HasCrCard)	-0.0322	Pass
(Balance, IsActiveMember)	-0.0290	Pass

# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3617	Fail
1	(IsActiveMember, Exited)	-0.1892	Pass
2	(Balance, NumOfProducts)	-0.1729	Pass
3	(Balance, Exited)	0.1578	Pass
4	(NumOfProducts, Exited)	-0.0664	Pass
5	(NumOfProducts, IsActiveMember)	0.0569	Pass
6	(Age, Balance)	0.0564	Pass
7	(Age, NumOfProducts)	-0.0348	Pass
8	(Balance, HasCrCard)	-0.0322	Pass
9	(Balance, IsActiveMember)	-0.0290	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify potentially redundant or highly collinear variables. The results table reports the top feature pairs ranked by Pearson correlation coefficient, along with Pass/Fail status based on the configured absolute correlation threshold of 0.3. In this run, the reported coefficients range from -0.1892 to 0.1578, and all listed feature pairs are marked as Pass.

Key insights:

No correlations exceed threshold: All reported absolute correlation coefficients are below the 0.3 threshold used in the test. Every listed pair therefore receives a Pass result.
Largest observed correlation is modest: The strongest relationship in the reported output is between IsActiveMember and Exited at -0.1892. This is the largest absolute coefficient shown and remains below the configured threshold.
Top relationships are concentrated at low magnitudes: The next largest reported correlations are Balance with NumOfProducts at -0.1729 and Balance with Exited at 0.1578. All remaining listed coefficients are below 0.07 in absolute value.
Both positive and negative relationships appear: The reported pairs include negative coefficients such as Balance with NumOfProducts (-0.1729) and positive coefficients such as Balance with Exited (0.1578). The observed linear relationships are therefore mixed in direction but uniformly low in magnitude.

The reported correlation structure shows no feature pairs exceeding the configured Pearson correlation threshold. The strongest observed relationships are limited to modest absolute values, with most listed pairs exhibiting only weak linear association. Based on the reported output, the test does not identify high pairwise linear correlation among the features shown.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(IsActiveMember, Exited)	-0.1892	Pass
(Balance, NumOfProducts)	-0.1729	Pass
(Balance, Exited)	0.1578	Pass
(NumOfProducts, Exited)	-0.0664	Pass
(NumOfProducts, IsActiveMember)	0.0569	Pass
(Balance, HasCrCard)	-0.0322	Pass
(Balance, IsActiveMember)	-0.0290	Pass
(Tenure, Exited)	-0.0268	Pass
(CreditScore, Tenure)	0.0219	Pass
(HasCrCard, EstimatedSalary)	-0.0213	Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
523	695	4	0.00	2	1	1	137537.22	0	False	False	True
797	828	9	0.00	2	1	1	81853.98	0	False	True	True
109	795	9	130862.43	1	1	1	114935.21	0	True	False	False
3933	546	8	0.00	1	1	1	66408.01	1	False	False	False
6937	621	5	0.00	2	1	0	191756.54	1	False	False	False

from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion submitted by the development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)

/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:525: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.9.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Training a potential challenger model

We're curious how an alternate model compares to our champion, so let's train a challenger as a basis for our testing.

Our champion logistic regression model is a simpler, parametric model that assumes a linear relationship between the independent variables and the log-odds of the outcome. While logistic regression may not capture complex patterns as effectively, it offers a high degree of interpretability and is easier to explain to stakeholders. However, risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=50, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

RandomForestClassifier

?Documentation for RandomForestClassifieriFitted

Parameters

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	50
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary <random_state>` for details.	42
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	None
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'sqrt'
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.	None
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary <warm_start>` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples irrespective of `sample_weight`. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` unweighted samples or `max_samples * sample_weight.sum()` weighted samples. .. versionadded:: 0.22 .. versionchanged:: 1.9 Float `max_samples` is relative to `sample_weight.sum()` instead of `X.shape[0]` for weighted samples.	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`). The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide <monotonic_cst_gbdt>`. .. versionadded:: 1.4	None

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) or a list of such arrays The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).	ndarray[int64](2,)	[0,1]
estimator_ estimator_: :class:`~sklearn.tree.DecisionTreeClassifier` The child estimator template used to create the collection of fitted sub-estimators. .. versionadded:: 1.2 `base_estimator_` was renamed to `estimator_`.	DecisionTreeClassifier	DecisionTreeClassifier()
estimators_ estimators_: list of DecisionTreeClassifier The collection of fitted sub-estimators.	list	[DecisionTreeC...te=1608637542), DecisionTreeC...te=1273642419), DecisionTreeC...te=1935803228), DecisionTreeC...ate=787846414), ...]
estimators_samples_ estimators_samples_: list of arrays The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected. .. versionadded:: 1.4	list	[array([2523, ..., dtype=int32), array([ 686, ..., dtype=int32), array([1366, ..., dtype=int32), array([1591, ..., dtype=int32), ...]
feature_importances_ feature_importances_: ndarray of shape (n_features,) The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance. Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See :func:`sklearn.inspection.permutation_importance` as an alternative.	ndarray[float64](10,)	[0.2 ,0.11,0.19,...,0.04,0.02,0.02]
feature_names_in_ feature_names_in_: ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0	ndarray[object](10,)	['CreditScore','Tenure','Balance',...,'Geography_Germany', 'Geography_Spain','Gender_Male']
n_classes_ n_classes_: int or list The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).	int	2
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	10
n_outputs_ n_outputs_: int The number of outputs when ``fit`` is performed.	int	1

Initialize the ValidMind models

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models.

Despite the naming convention, ValidMind model objects can be any type of record you want to test, document, validate, or monitor with the ValidMind Library.
From classical statistical and machine learning models, to generative and agentic AI systems and more, the ValidMind model object provides a consistent wrapper around your record so it can be passed as a unified input to any ValidMind test or test suite, with results sent directly to the ValidMind Platform.

Initialize your model objects with vm.init_model():

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

# Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)

2026-07-14 05:34:09,605 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:34:09,607 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:34:09,607 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:34:09,609 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-07-14 05:34:09,611 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:34:09,612 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:34:09,613 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:34:09,614 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-07-14 05:34:09,617 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:34:09,641 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:34:09,642 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:34:09,664 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-07-14 05:34:09,667 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:34:09,679 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:34:09,680 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:34:09,692 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Running model evaluation tests

With our setup complete, let's run the rest of our validation tests. Since we have already verified the data quality of the dataset used to train our champion, we will now focus on comprehensive performance evaluations of both the champion and challenger models.

Run model performance tests

Let's run some performance tests, beginning with independent testing of our champion logistic regression model, then moving on to our potential challenger model.

Use vm.tests.list_tests() to identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.model_validation.sklearn.CalibrationCurve	Calibration Curve	Evaluates the calibration of probability estimates by comparing predicted probabilities against observed...	True	False	['model', 'dataset']	{'n_bins': {'type': 'int', 'default': 10}}	['sklearn', 'model_performance', 'classification']	['classification']
validmind.model_validation.sklearn.ClassifierPerformance	Classifier Performance	Evaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...	False	True	['dataset', 'model']	{'average': {'type': 'str', 'default': 'macro'}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.ConfusionMatrix	Confusion Matrix	Evaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...	True	False	['dataset', 'model']	{'threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuning	Hyper Parameters Tuning	Performs exhaustive grid search over specified parameter ranges to find optimal model configurations...	False	True	['model', 'dataset']	{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': 'Union', 'default': None}, 'thresholds': {'type': 'Union', 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}	['sklearn', 'model_performance']	['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracy	Minimum Accuracy	Checks if the model's prediction accuracy meets or surpasses a specified threshold....	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.7}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1Score	Minimum F1 Score	Assesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScore	Minimum ROCAUC Score	Validates model by checking if the ROC AUC score meets or surpasses a specified threshold....	False	True	['dataset', 'model']	{'min_threshold': {'type': 'float', 'default': 0.5}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.ModelsPerformanceComparison	Models Performance Comparison	Evaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...	False	True	['dataset', 'models']	{}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']	['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndex	Population Stability Index	Assesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...	True	True	['datasets', 'model']	{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurve	Precision Recall Curve	Evaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....	True	False	['model', 'dataset']	{}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurve	ROC Curve	Evaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...	True	False	['model', 'dataset']	{}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrors	Regression Errors	Assesses the performance and error distribution of a regression model using various error metrics....	False	True	['model', 'dataset']	{}	['sklearn', 'model_performance']	['regression', 'classification']
validmind.model_validation.sklearn.TrainingTestDegradation	Training Test Degradation	Tests if model performance degradation between training and test datasets exceeds a predefined threshold....	False	True	['datasets', 'model']	{'max_threshold': {'type': 'float', 'default': 0.1}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.model_validation.statsmodels.GINITable	GINI Table	Evaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets....	False	True	['dataset', 'model']	{}	['model_performance']	['classification']
validmind.ongoing_monitoring.CalibrationCurveDrift	Calibration Curve Drift	Evaluates changes in probability calibration between reference and monitoring datasets....	True	True	['datasets', 'model']	{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDrift	Class Discrimination Drift	Compares classification discrimination metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDrift	Classification Accuracy Drift	Compares classification accuracy metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDrift	Confusion Matrix Drift	Compares confusion matrix metrics between reference and monitoring datasets....	False	True	['datasets', 'model']	{'drift_pct_threshold': {'type': '_empty', 'default': 20}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']	['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDrift	ROC Curve Drift	Compares ROC curves between reference and monitoring datasets....	True	False	['datasets', 'model']	{}	['sklearn', 'binary_classification', 'model_performance', 'visualization']	['classification', 'text_classification']

We'll isolate the specific tests we want to run in mpt:

model_validation.sklearn.ClassifierPerformance
model_validation.sklearn.ConfusionMatrix
model_validation.sklearn.MinimumAccuracy
model_validation.sklearn.MinimumF1Score
model_validation.sklearn.ROCCurve

As we learned in the previous notebook 2 — Start the model validation process, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here:

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:logreg_champion",
    "validmind.model_validation.sklearn.ROCCurve:logreg_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

The test set serves as a proxy for real-world data, providing an unbiased estimate of model performance since it was not used during training or tuning.
The test set also acts as protection against selection bias and model tweaking, giving a final, more unbiased checkpoint.

for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_log_model,
        },
    ).log()

Classifier Performance Logreg Champion

The Classifier Performance test evaluates classification model performance using precision, recall, F1-score, accuracy, and ROC AUC. The reported results are for a binary classifier with class-level metrics for classes 0 and 1, together with weighted and macro averages. Class 0 shows precision of 0.6183, recall of 0.6125, and F1 of 0.6154, while class 1 shows precision of 0.6242, recall of 0.6300, and F1 of 0.6271. Overall summary metrics report accuracy of 0.6213 and ROC AUC of 0.6707.

Key insights:

Class performance is balanced: Precision, recall, and F1-score are closely aligned across the two classes. Class 0 F1 is 0.6154 and class 1 F1 is 0.6271, with similarly small differences in precision and recall.
Aggregate metrics are consistent: Weighted-average and macro-average metrics are nearly identical, with precision at 0.6213 for both and recall/F1 differing only at the fourth decimal place. This indicates that aggregate performance is not being materially driven by one class over the other.
Overall accuracy is 0.6213: The model’s reported accuracy is 0.6213, matching the weighted-average precision, recall, and F1-score values shown in the summary table.
ROC AUC exceeds accuracy: ROC AUC is reported at 0.6707, which is higher than the observed accuracy of 0.6213 and provides a separate view of ranking performance beyond the classification threshold.

The results show a binary classifier with similar performance across both classes and no pronounced divergence between precision and recall within either class. Summary averages remain closely aligned with the class-level results, indicating a consistent aggregate performance profile. Accuracy is reported at 0.6213, and ROC AUC at 0.6707, reflecting moderate discrimination in the reported test output.

Tables

Precision, Recall, and F1

Class	Precision	Recall	F1
0	0.6183	0.6125	0.6154
1	0.6242	0.6300	0.6271
Weighted Average	0.6213	0.6213	0.6213
Macro Average	0.6213	0.6212	0.6212

Accuracy and ROC AUC

Metric	Value
Accuracy	0.6213
ROC AUC	0.6707

2026-07-14 05:34:18,305 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:logreg_champion does not exist in model's document

Confusion Matrix Logreg Champion

The Confusion Matrix test evaluates the classification model’s predictive performance by comparing predicted labels with observed labels and displaying the resulting counts of true positives, true negatives, false positives, and false negatives. In this result, the matrix shows 206 true positives, 196 true negatives, 124 false positives, and 121 false negatives. The plot presents these outcomes across the two class labels, allowing direct comparison of correct and incorrect classifications for both positive and negative cases.

Key insights:

Correct classifications exceed errors: The model records 206 true positives and 196 true negatives, compared with 124 false positives and 121 false negatives. Correct predictions total 402, while misclassifications total 245.
Error types are closely balanced: False positives and false negatives are similar in magnitude, at 124 and 121 respectively. This indicates that misclassification is distributed relatively evenly across the two error types.
Positive-class detection is stronger than negative-class detection: True positives (206) exceed true negatives (196), while false negatives (121) are slightly lower than false positives (124). This reflects marginally stronger identification of class 1 than class 0.
Observed class counts are similar: For true class 1, the matrix shows 327 observations in total (206 true positives and 121 false negatives). For true class 0, the matrix shows 320 observations in total (196 true negatives and 124 false positives), indicating a near-balanced evaluated sample.

The confusion matrix indicates that the model produces more correct classifications than errors, with similar performance across the two classes. Misclassification is split almost evenly between false positives and false negatives, with only a small difference between them. The evaluated sample is also nearly balanced by true class, which supports direct comparison of the model’s behavior across positive and negative outcomes.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion:3968

2026-07-14 05:34:26,636 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:logreg_champion does not exist in model's document

❌ Minimum Accuracy Logreg Champion

The Minimum Accuracy test evaluates whether the model’s prediction accuracy meets or exceeds a specified minimum threshold. The result table reports the observed accuracy score alongside the configured threshold and the overall pass/fail outcome. For logreg_champion, the recorded accuracy is 0.6213 against a threshold of 0.7, and the test outcome is marked as Fail.

Key insights:

Accuracy below threshold: The observed accuracy score is 0.6213, which is below the minimum threshold of 0.7 used in this test.
Test outcome is fail: The model did not meet the criterion defined by the Minimum Accuracy test, and the result is recorded as Fail.
Gap to threshold is 0.0787: The difference between the observed score and the threshold is 0.0787, quantifying the shortfall relative to the test requirement.

The test result shows that the model’s observed classification accuracy did not reach the minimum level defined for this evaluation. The measured score of 0.6213 falls 0.0787 below the 0.7 threshold, resulting in a failed test outcome.

Tables

Score	Threshold	Pass/Fail
0.6213	0.7	Fail

2026-07-14 05:34:31,743 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:logreg_champion does not exist in model's document

✅ Minimum F1 Score Logreg Champion

The Minimum F1 Score test evaluates whether the model’s F1 score on the validation dataset meets a predefined minimum threshold. The result table reports the observed F1 score, the configured threshold, and the resulting pass/fail status for logreg_champion. In this run, the validation F1 score is 0.6271 against a threshold of 0.5, and the test outcome is recorded as Pass.

Key insights:

F1 score exceeds threshold: The observed validation F1 score is 0.6271, which is 0.1271 higher than the minimum threshold of 0.5.
Test outcome is pass: The reported pass/fail status is Pass, indicating the model met the predefined acceptance criterion for this test.
Margin above minimum is positive: The difference between the observed score and threshold is positive, showing that performance cleared the minimum standard rather than matching it exactly.

The result shows that the model achieved an F1 score above the configured minimum on the validation set, with a recorded pass outcome. The observed margin of 0.1271 above the threshold indicates that the test condition was satisfied with measurable separation from the cutoff.

Tables

Score	Threshold	Pass/Fail
0.6271	0.5	Pass

2026-07-14 05:34:35,490 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:logreg_champion does not exist in model's document

ROC Curve Logreg Champion

The ROCCurve test evaluates binary classification performance by plotting the Receiver Operating Characteristic curve and calculating the Area Under the Curve (AUC). The result shows the ROC curve for log_model_champion on test_dataset_final, with the model curve plotted against the random-classification reference line. The reported AUC is 0.67, and the curve remains above the diagonal benchmark across most of the false positive rate range.

Key insights:

AUC indicates moderate discrimination: The plotted ROC curve reports an AUC of 0.67, summarizing the model’s ability to distinguish between the two classes on the test dataset.
Performance exceeds random baseline: The ROC curve stays above the dashed random-reference line associated with AUC = 0.5, indicating better-than-random ranking performance across thresholds.
Discrimination varies across thresholds: The curve increases gradually rather than sharply, with true positive rate gains occurring alongside rising false positive rates throughout the threshold range.

The ROC result shows that log_model_champion achieves classification performance above the random benchmark on test_dataset_final, as reflected by an AUC of 0.67 and a curve positioned above the diagonal baseline. The shape of the curve indicates measurable discriminatory ability, while the gradual ascent across thresholds reflects moderate separation between the positive and negative classes rather than strong concentration of true positives at low false positive rates.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:logreg_champion:cf40

2026-07-14 05:34:43,311 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:logreg_champion does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log an artifact

As we can observe from the output above, our champion doesn't pass the MinimumAccuracy based on the default thresholds of the out-of-the-box test, so let's log an artifact (finding) in the ValidMind Platform (Learn more: Add and manage artifacts):

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Validation under Documents.
Click on 2.2.2. Model Performance to expand that section.
Under the Model Performance Metrics guideline, click to expand the Artifacts panel.
Click Link Artifact and select Validation Issue as the type of artifact.
Click + Add Validation Issue and enter in the details for your validation issue, for example:
- Title — Champion Logistic Regression Model Fails Minimum Accuracy Threshold
- Risk Area — Model Performance
- Documentation Section — 3.2. Model Evaluation
- Description — The logistic regression champion model was subjected to a Minimum Accuracy test to determine whether its predictive accuracy meets the predefined performance threshold of 0.7. The model achieved an accuracy score of 0.6136, which falls below the required minimum. As a result, the test produced a Fail outcome.
Click Add Validation Issue to submit the validation issue.
Select the validation issue you just added to link to your validation report.
Click Update Linked Artifacts to insert your validation issue.
Confirm that the validation issue you inserted has been correctly inserted into section 2.2.2. Model Performance of the report.
Click on the validation issue to expand the issue, where you can adjust details such as severity, owner, due date, status, etc. as well as include proposed remediation plans or supporting documentation as attachments.

Evaluate performance of challenger model

We've now conducted similar tests as the development team for our champion, with the aim of verifying their test results.

Next, let's see how our challengers compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger",
    "validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger",
    "validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger",
    "validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger"
]

We'll run each test once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        }
    ).log()

Classifier Performance Champion Vs Challenger

The Classifier Performance test evaluates classification model performance using class-level precision, recall, F1-score, accuracy, and ROC AUC. The results compare the champion model (log_model_champion) and the challenger (rf_model) across both classes, along with macro and weighted averages. For log_model_champion, weighted-average precision, recall, and F1 are each approximately 0.621, with accuracy of 0.6213 and ROC AUC of 0.6707. For rf_model, weighted-average precision, recall, and F1 are each approximately 0.714, with accuracy of 0.7141 and ROC AUC of 0.7762.

Key insights:

Challenger outperforms champion across metrics: rf_model exceeds log_model_champion on all reported aggregate measures. Accuracy increases from 0.6213 to 0.7141, weighted-average F1 from 0.6213 to 0.7139, and ROC AUC from 0.6707 to 0.7762.
Stronger class-level performance in rf_model: For class 0, rf_model records precision 0.7003, recall 0.7375, and F1 0.7184 versus 0.6183, 0.6125, and 0.6154 for log_model_champion. For class 1, rf_model records precision 0.7290, recall 0.6911, and F1 0.7096 versus 0.6242, 0.6300, and 0.6271.
Champion shows more balanced class recall: log_model_champion has similar recall across class 0 (0.6125) and class 1 (0.6300). In contrast, rf_model shows a wider recall spread, with 0.7375 for class 0 and 0.6911 for class 1.
Aggregate averages are internally consistent: For both models, macro-average and weighted-average metrics are nearly identical. log_model_champion has macro-average F1 of 0.6212 and weighted-average F1 of 0.6213, while rf_model has macro-average F1 of 0.7140 and weighted-average F1 of 0.7139.

Overall, the comparison shows that rf_model delivers higher discrimination and classification performance than log_model_champion on every reported aggregate metric and on both class-level F1 scores. The champion model exhibits more even recall between the two classes, while the challenger’s results are stronger overall but somewhat less balanced across class recalls. The close alignment between macro and weighted averages for both models indicates that the aggregate summaries are consistent with the class-level results reported in the test.

Tables

model	Class	Precision	Recall	F1
log_model_champion	0	0.6183	0.6125	0.6154
log_model_champion	1	0.6242	0.6300	0.6271
log_model_champion	Weighted Average	0.6213	0.6213	0.6213
log_model_champion	Macro Average	0.6213	0.6212	0.6212
rf_model	0	0.7003	0.7375	0.7184
rf_model	1	0.7290	0.6911	0.7096
rf_model	Weighted Average	0.7148	0.7141	0.7139
rf_model	Macro Average	0.7147	0.7143	0.7140

model	Metric	Value
log_model_champion	Accuracy	0.6213
log_model_champion	ROC AUC	0.6707
rf_model	Accuracy	0.7141
rf_model	ROC AUC	0.7762

2026-07-14 05:34:50,852 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ClassifierPerformance:champion_vs_challenger does not exist in model's document

Confusion Matrix Champion Vs Challenger

The Confusion Matrix test evaluates classification performance by comparing predicted class labels with observed outcomes and summarizing results as true positives, true negatives, false positives, and false negatives. The results are presented for two models: log_model_champion and rf_model. For log_model_champion, the matrix contains 206 true positives, 196 true negatives, 124 false positives, and 121 false negatives. For rf_model, the matrix contains 226 true positives, 236 true negatives, 84 false positives, and 101 false negatives.

Key insights:

rf_model shows more correct classifications: rf_model records 226 true positives and 236 true negatives, compared with 206 and 196 respectively for log_model_champion. This indicates a higher count of correct predictions in both the positive and negative classes.
rf_model has fewer false positives: False positives decrease from 124 in log_model_champion to 84 in rf_model. The reduction is concentrated in cases where the true class is 0 but the model predicts 1.
rf_model has fewer false negatives: False negatives decrease from 121 in log_model_champion to 101 in rf_model. This reflects fewer missed positive cases for rf_model.
Error balance improves across both classes: Relative to log_model_champion, rf_model reduces both major error types while increasing both correct outcome counts. The observed improvement is not limited to a single quadrant of the confusion matrix.

Across the observed confusion matrix counts, rf_model demonstrates stronger classification performance than log_model_champion. The comparison shows higher true positive and true negative counts together with lower false positive and false negative counts. The results indicate a consistent improvement in classification outcomes across both classes rather than a trade-off between one error type and another.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:4ccf

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger:f0b6

2026-07-14 05:35:02,459 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ConfusionMatrix:champion_vs_challenger does not exist in model's document

❌ Minimum Accuracy Champion Vs Challenger

The Minimum Accuracy test evaluates whether each model’s prediction accuracy meets or exceeds the defined minimum threshold. The results table reports accuracy scores, the common threshold value of 0.7, and the corresponding pass/fail outcome for each model under comparison. Two models are shown: log_model_champion with an accuracy score of 0.6213 and rf_model with an accuracy score of 0.7141.

Key insights:

Only one model passes threshold: rf_model achieves a score of 0.7141 against the 0.7 threshold and is marked as Pass, while log_model_champion records 0.6213 and is marked as Fail.
Accuracy gap is material: The difference in accuracy between rf_model and log_model_champion is 0.0928, indicating higher observed predictive correctness for rf_model on this test.
Champion model falls below minimum: log_model_champion underperforms the defined threshold by 0.0787, based on its reported score of 0.6213 versus the 0.7 requirement.
Passing margin is narrow for challenger: rf_model exceeds the threshold by 0.0141, indicating that its passing result is above the minimum but with limited margin.

The test results show differing accuracy outcomes across the two evaluated models under a shared threshold. rf_model is the only model that meets the minimum accuracy criterion, while log_model_champion remains below the required level. The observed spread between model scores indicates stronger test-set accuracy for rf_model, although its pass result is only modestly above the threshold.

Tables

model	Score	Threshold	Pass/Fail
log_model_champion	0.6213	0.7	Fail
rf_model	0.7141	0.7	Pass

2026-07-14 05:35:10,439 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumAccuracy:champion_vs_challenger does not exist in model's document

✅ Minimum F1 Score Champion Vs Challenger

The MinimumF1Score test evaluates whether each model’s validation-set F1 score meets a predefined minimum threshold. The results table reports the validation F1 score, the threshold, and the pass/fail outcome for the champion and challenger models. In this test run, both models were evaluated against the same threshold of 0.5, with reported F1 scores of 0.6271 for log_model_champion and 0.7096 for rf_model.

Key insights:

Both models passed the threshold: log_model_champion and rf_model both received a Pass result against the minimum F1 threshold of 0.5, indicating that each model’s validation F1 score exceeded the defined cutoff.
Challenger achieved the higher F1 score: rf_model recorded an F1 score of 0.7096 versus 0.6271 for log_model_champion, a difference of 0.0825 on the validation set.
Margin above threshold differs by model: log_model_champion exceeded the threshold by 0.1271, while rf_model exceeded it by 0.2096, showing a larger buffer relative to the minimum requirement for rf_model.

The test results show that both evaluated models satisfied the minimum validation F1 criterion. Within this comparison, rf_model produced the stronger observed F1 performance and the larger margin above the common threshold, while log_model_champion also remained above the required minimum.

Tables

model	Score	Threshold	Pass/Fail
log_model_champion	0.6271	0.5	Pass
rf_model	0.7096	0.5	Pass

2026-07-14 05:35:14,390 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumF1Score:champion_vs_challenger does not exist in model's document

ROC Curve Champion Vs Challenger

The ROCCurve:champion_vs_challenger test evaluates binary classification performance by plotting the ROC curve and calculating the AUC for each model on the test dataset. The results show ROC curves for log_model_champion and rf_model, each compared against the random-classification reference line. The displayed AUC values are 0.67 for log_model_champion and 0.78 for rf_model, and both curves remain above the diagonal baseline across most of the false positive rate range.

Key insights:

Challenger shows higher AUC: rf_model records an AUC of 0.78 versus 0.67 for log_model_champion, indicating stronger class separation on the test dataset in this comparison.
Both models outperform random: Both ROC curves lie above the random baseline and both AUC values exceed 0.5, showing discriminative ability for each model.
Performance gap is visible across thresholds: The rf_model ROC curve stays consistently above the log_model_champion curve visually, reflecting stronger true positive rates at comparable false positive rates across much of the threshold range.
Champion exhibits more moderate discrimination: The log_model_champion curve rises more gradually and remains closer to the diagonal reference line than rf_model, consistent with its lower AUC value.

The ROC comparison indicates that both models provide positive discrimination between the two classes on the test dataset, with rf_model demonstrating stronger performance than log_model_champion. The difference is reflected both numerically, through the AUC values of 0.78 and 0.67, and visually, through the higher ROC curve for rf_model across most thresholds. Overall, the result shows a clear separation in ranking performance between the two evaluated models.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:f34b

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger:ad86

2026-07-14 05:35:26,496 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve:champion_vs_challenger does not exist in model's document

Based on the performance metrics, our challenger random forest classification model passes the MinimumAccuracy where our champion did not.

In your validation report, support your recommendation in your validation issue's Proposed Remediation Plan to investigate the usage of our challenger by inserting the performance tests we logged with this notebook into the appropriate section.

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.model_validation.sklearn.OverfitDiagnosis	Overfit Diagnosis	Assesses potential overfitting in a model's predictions, identifying regions where performance between training and...	True	True	['model', 'datasets']	{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}	['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']	['classification', 'regression']
validmind.model_validation.sklearn.RobustnessDiagnosis	Robustness Diagnosis	Assesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....	True	True	['datasets', 'model']	{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': 'List', 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}	['sklearn', 'model_diagnosis', 'visualization']	['classification', 'regression']
validmind.model_validation.sklearn.WeakspotsDiagnosis	Weakspots Diagnosis	Identifies and visualizes weak spots in a machine learning model's performance across various sections of the...	True	True	['datasets', 'model']	{'features_columns': {'type': 'Optional', 'default': None}, 'metrics': {'type': 'Optional', 'default': None}, 'thresholds': {'type': 'Optional', 'default': None}}	['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']	['classification', 'text_classification']

Let’s now assess the models for potential signs of overfitting and identify any sub-segments where performance may inconsistent with the model_validation.sklearn.OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data:

Since the training dataset (vm_train_ds) was used to fit the model, we use this set to establish a baseline performance for how well the model performs on data it has already seen.
The testing dataset (vm_test_ds) was never seen during training, and here simulates real-world generalization, or how well the model performs on new, unseen data.

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    }
).log()

Overfit Diagnosis Champion Vs Challenger

The Overfit Diagnosis test evaluates differences between training and test AUC across feature slices to identify regions where the performance gap exceeds the 0.04 threshold. Results are reported for both log_model_champion and rf_model, with slice-level training and test record counts, AUC values, and observed gaps. The output highlights which feature segments exceed the threshold and shows the magnitude and direction of the train-test gap within each segment. Across the reported features, the two models exhibit markedly different gap patterns and magnitudes.

Key insights:

Random forest shows pervasive gaps: For rf_model, every reported overfit region has a positive gap above the 0.04 threshold, and training AUC is 1.0 in all listed slices. Reported gaps span multiple features including CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Geography_Germany, Geography_Spain, and Gender_Male.
Champion model gaps are more localized: For log_model_champion, flagged gaps are concentrated in selected slices rather than consistently across all reported features. Several variables show only a small number of slices above threshold, including CreditScore, Tenure, Balance, NumOfProducts, EstimatedSalary, and Geography_Germany.
Largest champion-model gap occurs in Balance: The largest reported gap for log_model_champion is in Balance slice (25089.809, 50179.618], with training AUC 0.5128, test AUC 0.0, and gap 0.5128. This slice also has very small sample counts, with 25 training records and 2 test records.
Largest random-forest gap occurs in Balance: The largest reported gap for rf_model is in Balance slice (200718.472, 225808.281], with training AUC 1.0, test AUC 0.0, and gap 1.0. This slice is based on 16 training records and 1 test record.
Tenure is a prominent source of gaps: Tenure contains multiple flagged slices for both models. For log_model_champion, the largest Tenure gap is 0.1974 in slice (2.0, 3.0] with training AUC 0.6913 and test AUC 0.4939; for rf_model, all reported Tenure slices exceed threshold, with gaps ranging from 0.0998 to 0.2941.
NumOfProducts contains large gaps in both models: In log_model_champion, NumOfProducts slice (2.8, 3.1] has training AUC 0.7589, test AUC 0.5, and gap 0.2589. In rf_model, all reported NumOfProducts slices exceed threshold, including a maximum gap of 0.5 for the same (2.8, 3.1] slice.
EstimatedSalary gaps are broader for rf_model: log_model_champion shows selected EstimatedSalary slices above threshold, with the largest gap 0.119 in (79943.476, 99926.45]. In contrast, rf_model shows all listed EstimatedSalary slices above threshold, with gaps ranging from 0.1444 to 0.2965.
Binary geography and membership features differ by model: For log_model_champion, Geography_Germany is the only reported binary geography slice above threshold, with gap 0.1067 for (0.9, 1.0], while HasCrCard, IsActiveMember, Geography_Spain, and Gender_Male do not exceed threshold in the reported table. For rf_model, all listed slices for HasCrCard, IsActiveMember, Geography_Germany, Geography_Spain, and Gender_Male exceed threshold.

The results indicate a clear contrast between the two models at the slice level. log_model_champion exhibits threshold breaches in a limited set of feature regions, with the largest gaps concentrated in specific Balance, Tenure, NumOfProducts, and EstimatedSalary slices. rf_model displays substantially broader and larger train-test AUC gaps across nearly all reported features, with training AUC fixed at 1.0 in every listed overfit region and the most extreme gap appearing in a very small Balance slice.

Tables

model	Feature	Slice	Number of Training Records	Number of Test Records	Training AUC	Test AUC	Gap
log_model_champion	CreditScore	(450.0, 500.0]	122	29	0.6675	0.6111	0.0564
log_model_champion	CreditScore	(600.0, 650.0]	490	116	0.6892	0.6151	0.0741
log_model_champion	CreditScore	(700.0, 750.0]	391	106	0.7197	0.6638	0.0560
log_model_champion	Tenure	(1.0, 2.0]	270	71	0.6721	0.6040	0.0681
log_model_champion	Tenure	(2.0, 3.0]	277	73	0.6913	0.4939	0.1974
log_model_champion	Tenure	(6.0, 7.0]	224	71	0.6975	0.6184	0.0791
log_model_champion	Tenure	(8.0, 9.0]	262	64	0.7003	0.6265	0.0739
log_model_champion	Tenure	(9.0, 10.0]	132	31	0.6998	0.6218	0.0779
log_model_champion	Balance	(25089.809, 50179.618]	25	2	0.5128	0.0000	0.5128
log_model_champion	Balance	(150538.854, 175628.663]	189	48	0.6317	0.5861	0.0456
log_model_champion	Balance	(175628.663, 200718.472]	51	15	0.5603	0.2955	0.2649
log_model_champion	NumOfProducts	(1.9, 2.2]	928	228	0.7023	0.6499	0.0523
log_model_champion	NumOfProducts	(2.8, 3.1]	156	35	0.7589	0.5000	0.2589
log_model_champion	EstimatedSalary	(-188.25, 19994.554]	245	71	0.7310	0.6348	0.0962
log_model_champion	EstimatedSalary	(19994.554, 39977.528]	245	66	0.6489	0.6019	0.0470
log_model_champion	EstimatedSalary	(59960.502, 79943.476]	266	69	0.7044	0.6517	0.0527
log_model_champion	EstimatedSalary	(79943.476, 99926.45]	256	54	0.6822	0.5632	0.1190
log_model_champion	EstimatedSalary	(159875.372, 179858.346]	301	61	0.6853	0.6304	0.0549
log_model_champion	Geography_Germany	(0.9, 1.0]	794	204	0.6464	0.5397	0.1067
rf_model	CreditScore	(450.0, 500.0]	122	29	1.0000	0.6917	0.3083
rf_model	CreditScore	(500.0, 550.0]	261	69	1.0000	0.8752	0.1248
rf_model	CreditScore	(550.0, 600.0]	371	89	1.0000	0.7336	0.2664
rf_model	CreditScore	(600.0, 650.0]	490	116	1.0000	0.7770	0.2230
rf_model	CreditScore	(650.0, 700.0]	490	125	1.0000	0.7215	0.2785
rf_model	CreditScore	(700.0, 750.0]	391	106	1.0000	0.7998	0.2002
rf_model	CreditScore	(750.0, 800.0]	233	61	1.0000	0.8203	0.1797
rf_model	CreditScore	(800.0, 850.0]	156	40	1.0000	0.7995	0.2005
rf_model	Tenure	(-0.01, 1.0]	352	89	1.0000	0.7885	0.2115
rf_model	Tenure	(1.0, 2.0]	270	71	1.0000	0.8407	0.1593
rf_model	Tenure	(2.0, 3.0]	277	73	1.0000	0.7205	0.2795
rf_model	Tenure	(3.0, 4.0]	268	71	1.0000	0.7584	0.2416
rf_model	Tenure	(4.0, 5.0]	265	58	1.0000	0.9002	0.0998
rf_model	Tenure	(5.0, 6.0]	256	54	1.0000	0.8462	0.1538
rf_model	Tenure	(6.0, 7.0]	224	71	1.0000	0.7532	0.2468
rf_model	Tenure	(7.0, 8.0]	279	65	1.0000	0.7289	0.2711
rf_model	Tenure	(8.0, 9.0]	262	64	1.0000	0.7696	0.2304
rf_model	Tenure	(9.0, 10.0]	132	31	1.0000	0.7059	0.2941
rf_model	Balance	(-250.898, 25089.809]	852	200	1.0000	0.8546	0.1454
rf_model	Balance	(50179.618, 75269.427]	96	21	1.0000	0.7404	0.2596
rf_model	Balance	(75269.427, 100359.236]	290	69	1.0000	0.6370	0.3630
rf_model	Balance	(100359.236, 125449.045]	579	159	1.0000	0.7876	0.2124
rf_model	Balance	(125449.045, 150538.854]	485	132	1.0000	0.6951	0.3049
rf_model	Balance	(150538.854, 175628.663]	189	48	1.0000	0.7357	0.2643
rf_model	Balance	(175628.663, 200718.472]	51	15	1.0000	0.7841	0.2159
rf_model	Balance	(200718.472, 225808.281]	16	1	1.0000	0.0000	1.0000
rf_model	NumOfProducts	(0.997, 1.3]	1462	379	1.0000	0.6768	0.3232
rf_model	NumOfProducts	(1.9, 2.2]	928	228	1.0000	0.6869	0.3131
rf_model	NumOfProducts	(2.8, 3.1]	156	35	1.0000	0.5000	0.5000
rf_model	HasCrCard	(-0.001, 0.1]	768	191	1.0000	0.7384	0.2616
rf_model	HasCrCard	(0.9, 1.0]	1817	456	1.0000	0.7903	0.2097
rf_model	IsActiveMember	(-0.001, 0.1]	1371	362	1.0000	0.7710	0.2290
rf_model	IsActiveMember	(0.9, 1.0]	1214	285	1.0000	0.7699	0.2301
rf_model	EstimatedSalary	(-188.25, 19994.554]	245	71	1.0000	0.7532	0.2468
rf_model	EstimatedSalary	(19994.554, 39977.528]	245	66	1.0000	0.7681	0.2319
rf_model	EstimatedSalary	(39977.528, 59960.502]	250	61	1.0000	0.7035	0.2965
rf_model	EstimatedSalary	(59960.502, 79943.476]	266	69	1.0000	0.7535	0.2465
rf_model	EstimatedSalary	(79943.476, 99926.45]	256	54	1.0000	0.7040	0.2960
rf_model	EstimatedSalary	(99926.45, 119909.424]	259	61	1.0000	0.8225	0.1775
rf_model	EstimatedSalary	(119909.424, 139892.398]	258	69	1.0000	0.7814	0.2186
rf_model	EstimatedSalary	(139892.398, 159875.372]	269	57	1.0000	0.8183	0.1817
rf_model	EstimatedSalary	(159875.372, 179858.346]	301	61	1.0000	0.8297	0.1703
rf_model	EstimatedSalary	(179858.346, 199841.32]	236	78	1.0000	0.8556	0.1444
rf_model	Geography_Germany	(-0.001, 0.1]	1791	443	1.0000	0.7780	0.2220
rf_model	Geography_Germany	(0.9, 1.0]	794	204	1.0000	0.7064	0.2936
rf_model	Geography_Spain	(-0.001, 0.1]	1978	507	1.0000	0.7459	0.2541
rf_model	Geography_Spain	(0.9, 1.0]	607	140	1.0000	0.8653	0.1347
rf_model	Gender_Male	(-0.001, 0.1]	1242	350	1.0000	0.7353	0.2647
rf_model	Gender_Male	(0.9, 1.0]	1343	297	1.0000	0.8127	0.1873

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:94d2

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:aac8

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:075b

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:6791

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:7228

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:2f1b

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:31ec

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:1b6c

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:27a6

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:8972

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:ee04

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:e3d0

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:40bc

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:67cd

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:da8b

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:0110

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:644d

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:89c3

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:c395

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger:49fd

2026-07-14 05:35:46,048 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.OverfitDiagnosis:champion_vs_challenger does not exist in model's document

Let's also conduct robustness and stability testing of the two models with the model_validation.sklearn.RobustnessDiagnosis test.

Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

Again, we'll use both the training and testing datasets to establish baseline performance and to simulate real-world generalization:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_log_model,vm_rf_model]
    },
).log()

❌ Robustness Diagnosis Champion Vs Log Regression

The Robustness Diagnosis:Champion_vs_LogRegression test evaluates model robustness by measuring AUC decay after applying Gaussian noise to numeric input features at increasing perturbation sizes. Results are reported for both log_model_champion and rf_model on the training and test datasets across perturbation levels from baseline through 0.5 standard deviations. The table and plots show baseline AUC, perturbed AUC, performance decay, and pass/fail outcomes for each dataset-model combination, allowing comparison of how each model’s performance changes as noise intensity increases.

Key insights:

Logistic model shows limited decay: log_model_champion baseline AUC is 0.6827 on train and 0.6707 on test, and remains within 0.6700 to 0.6826 on train and 0.6476 to 0.6737 on test across all perturbation levels. Reported performance decay stays between -0.0030 and 0.0231, and all observations pass.
Random forest train performance declines sharply: rf_model training AUC declines from 1.0000 at baseline to 0.9837, 0.9433, 0.8945, 0.8493, and 0.7883 as perturbation increases from 0.1 to 0.5. Corresponding train performance decay rises from 0.0163 to 0.2117, with failures beginning at perturbation size 0.2 and continuing through 0.5.
Random forest test performance weakens at higher noise: On the test dataset, rf_model starts at AUC 0.7762 and moves to 0.7628, 0.7753, 0.7433, 0.7243, and 0.6723 across increasing perturbation sizes. Test performance decay remains small through 0.3 at 0.0134, 0.0009, and 0.0329, then increases to 0.0519 and 0.1039 at 0.4 and 0.5, where the test results fail.
Logistic model is more stable across datasets: At the highest perturbation size of 0.5, log_model_champion records AUC of 0.6700 on train and 0.6476 on test, compared with rf_model at 0.7883 on train and 0.6723 on test. However, the logistic model’s decay is much smaller than the random forest’s at the same perturbation level, particularly on train where decay is 0.0127 versus 0.2117.
Test-set response is not strictly monotonic: Both models show some non-monotonic changes on the test dataset under intermediate noise levels. log_model_champion test AUC increases from 0.6630 at 0.2 to 0.6737 at 0.3, producing a negative decay of -0.0030, while rf_model test AUC at 0.2 (0.7753) is slightly above its 0.1 value (0.7628) and close to baseline.

The robustness results differentiate the two models clearly under Gaussian perturbation. log_model_champion maintains relatively stable AUC on both training and test datasets across all evaluated noise levels and passes all recorded conditions. rf_model begins with higher baseline AUC, but its training performance deteriorates materially as perturbation increases, with failures from 0.2 onward, and its test performance also fails at the two highest perturbation levels. Collectively, the results show lower sensitivity to the applied noise for log_model_champion and stronger performance decay for rf_model, especially on the training dataset.

Tables

model	Perturbation Size	Dataset	Row Count	AUC	Performance Decay	Passed
log_model_champion	Baseline (0.0)	train_dataset_final	2585	0.6827	0.0000	True
log_model_champion	Baseline (0.0)	test_dataset_final	647	0.6707	0.0000	True
log_model_champion	0.1	train_dataset_final	2585	0.6826	0.0002	True
log_model_champion	0.1	test_dataset_final	647	0.6685	0.0022	True
log_model_champion	0.2	train_dataset_final	2585	0.6795	0.0032	True
log_model_champion	0.2	test_dataset_final	647	0.6630	0.0077	True
log_model_champion	0.3	train_dataset_final	2585	0.6736	0.0091	True
log_model_champion	0.3	test_dataset_final	647	0.6737	-0.0030	True
log_model_champion	0.4	train_dataset_final	2585	0.6653	0.0174	True
log_model_champion	0.4	test_dataset_final	647	0.6541	0.0166	True
log_model_champion	0.5	train_dataset_final	2585	0.6700	0.0127	True
log_model_champion	0.5	test_dataset_final	647	0.6476	0.0231	True
rf_model	Baseline (0.0)	train_dataset_final	2585	1.0000	0.0000	True
rf_model	Baseline (0.0)	test_dataset_final	647	0.7762	0.0000	True
rf_model	0.1	train_dataset_final	2585	0.9837	0.0163	True
rf_model	0.1	test_dataset_final	647	0.7628	0.0134	True
rf_model	0.2	train_dataset_final	2585	0.9433	0.0567	False
rf_model	0.2	test_dataset_final	647	0.7753	0.0009	True
rf_model	0.3	train_dataset_final	2585	0.8945	0.1054	False
rf_model	0.3	test_dataset_final	647	0.7433	0.0329	True
rf_model	0.4	train_dataset_final	2585	0.8493	0.1507	False
rf_model	0.4	test_dataset_final	647	0.7243	0.0519	False
rf_model	0.5	train_dataset_final	2585	0.7883	0.2117	False
rf_model	0.5	test_dataset_final	647	0.6723	0.1039	False

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:3f2f

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression:3f29

2026-07-14 05:36:05,859 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression does not exist in model's document

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI

['validmind.model_validation.FeaturesAUC',
 'validmind.model_validation.sklearn.PermutationFeatureImportance',
 'validmind.model_validation.sklearn.SHAPGlobalImportance']

We'll only use our testing dataset (vm_test_ds) here, to provide a realistic, unseen sample that mimic future or production data, as the training dataset has already influenced our model during learning:

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':champion_vs_challenger')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_log_model,vm_rf_model]
        },
    ).log()

Features Champion Vs Challenger

The FeaturesAUC test evaluates the discriminatory power of each individual feature by calculating a univariate AUC against the binary target. The result is presented as a ranked bar chart for test_dataset_final, with feature-level AUC values spanning approximately 0.40 to 0.59. Geography_Germany and Balance appear at the top of the ranking with the highest AUC values, while NumOfProducts appears at the bottom with the lowest AUC. Most remaining features are clustered in the mid-range, roughly between 0.43 and 0.48.

Key insights:

Geography_Germany has the highest AUC: Geography_Germany shows the strongest standalone discrimination in the chart, with an AUC close to 0.59, making it the top-ranked individual feature in this test.
Balance is a close second: Balance follows closely behind Geography_Germany, with an AUC near 0.57, indicating comparatively strong univariate separation relative to the other features shown.
Most features are tightly clustered: EstimatedSalary, HasCrCard, Tenure, Geography_Spain, CreditScore, IsActiveMember, and Gender_Male fall within a relatively narrow AUC band of roughly 0.43 to 0.48, indicating limited spread across the middle of the ranking.
NumOfProducts is lowest-ranked: NumOfProducts has the smallest AUC in the figure, at approximately 0.40, placing it at the bottom of the univariate discrimination ranking.

Overall, the test shows moderate variation in standalone feature discrimination across the evaluated inputs. Two features, Geography_Germany and Balance, are visibly separated from the rest of the ranking at the upper end, while the majority of features form a compact middle group. The lower end of the distribution is represented by NumOfProducts, which has the weakest univariate AUC among the features displayed.

Figures

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:5169

ValidMind Figure validmind.model_validation.FeaturesAUC:champion_vs_challenger:d8b3

2026-07-14 05:36:22,616 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.FeaturesAUC:champion_vs_challenger does not exist in model's document

Permutation Feature Importance Champion Vs Challenger

The Permutation Feature Importance test evaluates the significance of each input feature by measuring the change in model performance after that feature is randomly permuted. The result is presented as two feature-importance bar charts, one for the champion logistic model (log_model_champion) and one for the challenger random forest model (rf_model). The charts show the relative magnitude and direction of permutation importance for each feature, highlighting which variables contribute most to each model’s predictive performance and which have near-zero or negative contribution under this test.

Key insights:

Different leading features across models: The champion model is most influenced by Geography_Germany, IsActiveMember, and Gender_Male, while the challenger model is most influenced by NumOfProducts, Balance, and Geography_Germany. This indicates materially different feature reliance patterns between the two models.
Challenger importance is more concentrated: In the challenger model, NumOfProducts has the largest importance by a clear margin, followed by Balance and Geography_Germany, with the remaining features contributing only marginally. The champion model shows a comparatively more distributed profile across its top three features.
Champion assigns stronger importance to membership and gender indicators: IsActiveMember and Gender_Male are among the three largest importance values in the champion model, whereas both features are much smaller contributors in the challenger model. This reflects a substantially greater dependence on these predictors in the champion specification.
Negative importances appear in both models: Several features have negative permutation importance values. In the champion model these include Balance, HasCrCard, CreditScore, EstimatedSalary, and Geography_Spain; in the challenger model, HasCrCard is negative and CreditScore is approximately zero to slightly negative. These results indicate that permuting those features did not reduce performance in these runs.
Shared low-contribution features: Tenure is low-importance in both models, and Geography_Spain is near zero in the challenger model while slightly negative in the champion model. This indicates limited incremental contribution from these variables relative to the higher-ranked predictors.

The permutation importance results show that the champion and challenger models rely on different subsets of the input space. The challenger model is dominated by NumOfProducts and Balance, while the champion model places its greatest weight on Geography_Germany, IsActiveMember, and Gender_Male. Both models also contain features with near-zero or negative permutation importance, indicating that contribution is unevenly distributed across predictors rather than broadly shared.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:bc58

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger:f2e2

2026-07-14 05:36:46,117 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.PermutationFeatureImportance:champion_vs_challenger does not exist in model's document

SHAP Global Importance Champion Vs Challenger

The SHAP Global Importance test evaluates global feature importance using SHAP values to explain model behavior and compare feature influence across models. The results include a normalized SHAP importance bar chart and a SHAP summary plot for log_model_champion, along with SHAP interaction-style plots for rf_model. For log_model_champion, the bar chart ranks features by normalized importance, while the summary plot shows the direction and dispersion of each feature’s SHAP contribution. For rf_model, the displayed SHAP plots are limited to Tenure and CreditScore, with point distributions centered near zero and extending to both positive and negative values.

Key insights:

Champion model is concentrated in few features: In log_model_champion, IsActiveMember has the highest normalized importance at 100, followed by Geography_Germany at roughly the low 80s and Gender_Male at roughly the mid 70s. Balance is the next largest contributor at about the mid 40s, while all remaining features are materially lower.
Lower-ranked features have limited global contribution: CreditScore, Tenure, NumOfProducts, HasCrCard, and EstimatedSalary all appear below roughly 20 on the normalized scale in log_model_champion, and Geography_Spain is the smallest contributor at only a few percentage points.
Binary features show directionally separated effects: In the log_model_champion summary plot, IsActiveMember, Geography_Germany, and Gender_Male each show two concentrated SHAP groupings on opposite sides of zero, consistent with distinct directional effects associated with their low and high feature values.
Balance has the widest continuous spread: Balance in log_model_champion shows a broad SHAP range extending from modestly negative values to strongly positive values, with higher feature values concentrated on the positive side of the axis.
CreditScore and Tenure show inverse color-direction patterns: In the log_model_champion summary plot, higher CreditScore values are concentrated more on the negative SHAP side while lower values appear more on the positive side. A similar pattern is visible for Tenure, where higher values are positioned more negatively and lower values more positively.
Random forest plots are narrow and feature-limited: The rf_model visuals shown are restricted to Tenure and CreditScore. In both displayed panels, the SHAP interaction values are tightly clustered around zero, with symmetric dispersion to either side and no single dominant direction evident from the plotted distributions.

The SHAP results show that log_model_champion relies most heavily on IsActiveMember, Geography_Germany, and Gender_Male, with a clear drop-off after the top four features. The summary plot indicates both directional separation for binary variables and broader continuous variation for Balance, while CreditScore and Tenure display predominantly inverse relationships between feature value and SHAP effect. The rf_model output shown here is narrower in scope, covering only Tenure and CreditScore, and the displayed SHAP interaction values remain centered close to zero without a pronounced directional pattern.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:2562

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:f99d

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:7407

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger:3653

2026-07-14 05:37:00,240 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.SHAPGlobalImportance:champion_vs_challenger does not exist in model's document

In summary

In this third notebook, you learned how to:

Initialize ValidMind model objects
Assign predictions and probabilities to your ValidMind model objects
Use tests from ValidMind to evaluate the potential of models, including comparative tests between champion and challengers
Log an artifact in the ValidMind Platform

Next steps

Finalize validation and reporting

Now that you're familiar with the basics of using the ValidMind Library to run and log validation tests, let's learn how to implement some custom tests and wrap up our validation: 4 — Finalize validation and reporting