Run comparison tests

Learn how to use the ValidMind Library to run comparison tests that take any datasets or models as inputs. Identify comparison tests to run, initialize ValidMind dataset and model objects in preparation for passing them to tests, and then run tests — generating outputs automatically logged to your model's documentation in the ValidMind Platform.

We recommend that you first complete our introductory notebook on running tests.

Run dataset-based tests

About ValidMind

ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

Before you begin

This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.

If you encounter errors due to missing modules in your Python environment, install the modules with pip install, and then re-run the notebook. For more help, refer to Installing Python Modules.

New to ValidMind?

If you haven't already seen our documentation on the ValidMind Library, we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.

For access to all features available in this notebook, you'll need access to a ValidMind account.

Register with ValidMind

Key concepts

Model documentation: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.

Documentation template: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.

Tests: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.

Metrics: A subset of tests that do not have thresholds. In the context of this notebook, metrics and tests can be thought of as interchangeable concepts.

Custom metrics: Custom metrics are functions that you define to evaluate your model or dataset. These functions can be registered with the ValidMind Library to be used in the ValidMind Platform.

Inputs: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:

model: A single model that has been initialized in ValidMind with vm.init_model().
dataset: Single dataset that has been initialized in ValidMind with vm.init_dataset().
models: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom metric.
datasets: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom metric. (Learn more: Run tests with multiple datasets)

Parameters: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a metric, customize its behavior, or provide additional context.

Outputs: Custom metrics can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.

Test suites: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.

Example: the classifier_full_suite test suite runs tests from the tabular_dataset and classifier test suites to fully document the data and model sections for binary classification model use-cases.

Setting up

Install the ValidMind Library

Recommended Python versions

Python 3.8 <= x <= 3.14

To install the library:

%pip install -q validmind

Initialize the ValidMind Library

Register sample model

Let's first register a sample model for use with this notebook.

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and click + Register Model.
Enter the model details and click Next > to continue to assignment of model stakeholders. (Need more help?)
Select your own name under the MODEL OWNER drop-down.
Click Register Model to add the model to your inventory.

Apply documentation template

Once you've registered your model, let's select a documentation template. A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.

In the left sidebar that appears for your model, click Documents and select Development.
Under TEMPLATE, select Binary classification.
Click Use Template to apply the template.

Get your code snippet

Initialize the ValidMind Library with the code snippet unique to each model per document, ensuring your test results are uploaded to the correct model and automatically populated in the right document in the ValidMind Platform when you run this notebook.

On the left sidebar that appears for your model, select Getting Started and select Development from the DOCUMENT drop-down menu.
Click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet::

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    document="documentation",
)

Preview the documentation template

Let's verify that you have connected the ValidMind Library to the ValidMind Platform and that the appropriate template is selected for your model.

You will upload documentation and test results unique to your model based on this template later on. For now, take a look at the default structure that the template provides with the vm.preview_template() function from the ValidMind library and note the empty sections:

vm.preview_template()

Initialize the Python environment

Next, let's import the necessary libraries and set up your Python environment for data analysis:

import xgboost as xgb

%matplotlib inline

Explore a ValidMind test

Before we run a test, use the vm.tests.list_tests() function to return information on out-of-the-box tests available in the ValidMind Library.

Let's assume you want to evaluate classifier performance for a model. Classifier performance measures how well a classification model correctly predicts outcomes, using metrics like precision, recall, and F1 score.

We'll pass in a filter to the list_tests function to find the test ID for classifier performance:

vm.tests.list_tests(filter="ClassifierPerformance")

We've identified from the output that the test ID for the classifier performance test is validmind.model_validation.ClassifierPerformance.

Use this ID combined with the describe_test() function to retrieve more information about the test, including its Required Inputs:

test_id = "validmind.model_validation.sklearn.ClassifierPerformance"
vm.tests.describe_test(test_id)

Since this test requires a dataset and a model, you can expect it to throw an error when we run it without passing in either as input:

try:
    vm.tests.run_test(test_id)
except Exception as e:
    print(e)

Learn more about the individual tests available in the ValidMind Library

Check out our Explore tests notebook for more code examples and usage of key functions.

Working with ValidMind datasets

Import the sample dataset

Since we need a dataset to run tests, let's import the public Bank Customer Churn Prediction dataset from Kaggle so that we have something to work with.

In our below example, note that:

The target column, Exited has a value of 1 when a customer has churned and 0 otherwise.
The ValidMind Library provides a wrapper to automatically load the dataset as a Pandas DataFrame object. A Pandas Dataframe is a two-dimensional tabular data structure that makes use of rows and columns.

# Import the sample dataset from the library

from validmind.datasets.classification import customer_churn

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{customer_churn.target_column}' \n\t• Class labels: {customer_churn.class_labels}"
)

raw_df = customer_churn.load_data()
raw_df.head()

Split the dataset

Let's first split our dataset to help assess how well the model generalizes to unseen data.

Use preprocess() to split our dataset into three subsets:

train_df — Used to train the model.
validation_df — Used to evaluate the model's performance during training.
test_df — Used later on to asses the model's performance on new, unseen data.

train_df, validation_df, test_df = customer_churn.preprocess(raw_df)

Initialize the ValidMind dataset

The next step is to connect your data with a ValidMind Dataset object. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind, but you only need to do it once per dataset.

ValidMind dataset objects provide a wrapper to any type of dataset (NumPy, Pandas, Polars, etc.) so that tests can run transparently regardless of the underlying library.

Initialize a ValidMind dataset object using the init_dataset function from the ValidMind (vm) module. For this example, we'll pass in the following arguments:

dataset — The raw dataset that you want to provide as input to tests.
input_id — A unique identifier that allows tracking what inputs are used when running each individual test.
target_column — A required argument if tests require access to true values. This is the name of the target column in the dataset.

vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=customer_churn.target_column,
)

vm_test_ds = vm.init_dataset(
    dataset=test_df,
    input_id="test_dataset",
    target_column=customer_churn.target_column,
)

Working with ValidMind models

Train a sample model

To train the model, we need to provide it with:

Inputs — Features such as customer age, usage, etc.
Outputs (Expected answers/labels) — in our case, we would like to know whether the customer churned or not.

Here, we'll use x_train and x_val to hold the input data (features), and y_train and y_val to hold the answers (the target we want to predict):

x_train = train_df.drop(customer_churn.target_column, axis=1)
y_train = train_df[customer_churn.target_column]
x_val = validation_df.drop(customer_churn.target_column, axis=1)
y_val = validation_df[customer_churn.target_column]

Next, let's create an XGBoost classifier model that will automatically stop training if it doesn't improve after 10 tries. XGBoost is a gradient-boosted tree ensemble that builds trees sequentially, with each tree correcting the errors of the previous ones — typically known for strong predictive performance and built-in regularization to reduce overfitting.

Setting an explicit threshold avoids wasting time and helps prevent further overfitting by stopping training when further improvement isn't happening. We'll also set three evaluation metrics to get a more complete picture of model performance:

error — Measures how often the model makes incorrect predictions.
logloss — Indicates how confident the predictions are.
auc — Evaluates how well the model distinguishes between churn and not churn.

model = xgb.XGBClassifier(early_stopping_rounds=10)
model.set_params(
    eval_metric=["error", "logloss", "auc"],
)

Finally, our actual training step — where the model learns patterns from the data, so it can make predictions later:

The model is trained on x_train and y_train, and evaluates its performance using x_val and y_val to check if it’s learning well.
To turn off printed output while training, we'll set verbose to False.

model.fit(
    x_train,
    y_train,
    eval_set=[(x_val, y_val)],
    verbose=False,
)

Initialize the ValidMind model

You'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for our model.

You simply initialize this model object with vm.init_model():

vm_model_xgb = vm.init_model(
    model,
    input_id="xgboost",
)

Assign predictions

Once the model has been registered, you can assign model predictions to the training and testing datasets.

The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

vm_train_ds.assign_predictions(model=vm_model_xgb)
vm_test_ds.assign_predictions(model=vm_model_xgb)

Running ValidMind tests

Now that we know how to initialize ValidMind dataset and model objects, we're ready to run some tests!

You run individual tests by calling the run_test function provided by the validmind.tests module. For the examples below, we'll pass in the following arguments:

test_id — The ID of the test to run, as seen in the ID column when you run list_tests.
inputs — A dictionary of test inputs, such as dataset, model, datasets, or models. These are ValidMind objects initialized with vm.init_dataset() or vm.init_model().

Run classifier performance test with one model

Run validmind.data_validation.ClassifierPerformance test with the testing dataset (vm_test_ds) and model (vm_model_xgb) as inputs:

result = vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_model_xgb,
    },
)

Run comparison tests

To evaluate which models might be a better fit for a use case based on their performance on selected criteria, we can run the same test with multiple models. We'll train three additional models and run the classifier performance test with for all four models using a single run_test() call.

ValidMind helps streamline your documentation and testing.

You could call run_test() multiple times passing in different inputs, but you can also pass an input_grid object — a dictionary of test input keys and values that allow you to run a single test for a combination of models and datasets.

With input_grid, run comparison tests for multiple datasets, or even multiple datasets and models simultaneously — input_grid can be used with run_test() for all possible combinations of inputs, generating a cohesive and comprehensive single output.

Random forest classifier models use an ensemble method that builds multiple decision trees and averages their predictions. Random forest is robust to overfitting and handles non-linear relations well, but is typically less interpretable than simpler models:

from sklearn.ensemble import RandomForestClassifier

# Train the random forest classifer model
model_rf = RandomForestClassifier()
model_rf.fit(x_train, y_train)

# Initialize the ValidMind model object for the random forest classifer model
vm_model_rf = vm.init_model(
    model_rf,
    input_id="random_forest",
)

# Assign predictions to the test dataset for the random forest classifer model
vm_test_ds.assign_predictions(model=vm_model_rf)

Logistic regression models are linear models that estimate class probabilities via a logistic (sigmoid) function. Logistic regression is highly interpretable with fast training, establishing a strong baseline — however, they struggle when relationships are non-linear as real-world relationships often are:

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Scaling features ensures the lbfgs solver converges reliably
model_lr = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression()),
])
model_lr.fit(x_train, y_train)

# Initialize the ValidMind model object for the logistic regression model
vm_model_lr = vm.init_model(
    model_lr,
    input_id="logistic_regression",
)

# Assign predictions to the test dataset for the logistic regression model
vm_test_ds.assign_predictions(model=vm_model_lr)

Decision tree classifier models are a single tree with data split on feature thresholds. Useful as an explanability benchmark, decision trees are easy to visualize and interpret — but are prone to overfitting without pruning or ensemble techniques:

from sklearn.tree import DecisionTreeClassifier

# Train the decision tree classifer model
model_dt = DecisionTreeClassifier()
model_dt.fit(x_train, y_train)

# Initialize the ValidMind model object for the decision tree classifier model
vm_model_dt = vm.init_model(
    model_dt,
    input_id="decision_tree",
)

# Assign predictions to the test dataset for the decision tree classifiermodel
vm_test_ds.assign_predictions(model=vm_model_dt)

Run classifier performance test with multiple models

Now, we'll use the input_grid to run the ClassifierPerformance test on all four models using the testing dataset (vm_test_ds).

When running individual tests, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier to signify that this test was run on all_models to differentiate this test run from other runs:

perf_comparison_result = vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance:all_models",
    input_grid={
        "dataset": [vm_test_ds],
        "model": [vm_model_xgb, vm_model_rf, vm_model_lr, vm_model_dt],
    },
)

Our output indicates that the XGBoost and random forest classification models provide the strongest overall classification performance, so we'll continue our testing with those two models as input only.

Run classifier performance test with multiple parameter values

Next, let's run the classifier performance test with the param_grid object, which runs the same test multiple times with different parameter values. We'll append an identifier to signify that this test was run with our parameter_grid configuration:

parameter_comparison_result = vm.tests.run_test(
    "validmind.model_validation.sklearn.ClassifierPerformance:parameter_grid",
    input_grid={
        "dataset": [vm_test_ds],
        "model": [vm_model_xgb,vm_model_rf]
    },
    param_grid={
        "average": ["macro", "micro"]
    },
)

Run comparison test with multiple datasets

Let's also run the ROCCurve test using input_grid to iterate through multiple datasets, which plots the ROC curves for the training (vm_train_ds) and test (vm_test_ds) datasets side by side — a common scenario when you want to compare the performance of a model on the training and test datasets and visually assess how much performance is lost in the test dataset.

We'll also need to assign predictions to the training dataset for the random forest classifier model, since we didn't do that in our earlier setup:

vm_train_ds.assign_predictions(model=vm_model_rf)

We'll append an identifier to signify that this test was run with our train_vs_test dataset comparison configuration:

roc_curve_result = vm.tests.run_test(
    "validmind.model_validation.sklearn.ROCCurve:train_vs_test",
    input_grid={
        "dataset": [vm_train_ds, vm_test_ds],
        "model": [vm_model_xgb,vm_model_rf],
    },
)

Work with test results

Every test result returned by the run_test() function has a .log() method that can be used to send the test results to the ValidMind Platform. When logging individual test results to the platform, you'll need to manually add those results to the desired section of the model documentation.

You can do this through the ValidMind Platform interface after logging your test results (Learn more ...), or directly via the ValidMind Library when calling .log() by providing an optional section_id. The section_id should be a string that matches the title of a section in the documentation template in snake_case.

Let's log the results of the classifier performance test (perf_comparison_result) and the ROCCurve (roc_curve_result) test in the model_evaluation section of the documentation — present in the template we previewed in the beginning of this notebook:

perf_comparison_result.log(section_id="model_evaluation")
roc_curve_result.log(section_id="model_evaluation")

Finally, let's head to the model we connected to at the beginning of this notebook and view our inserted test results in the updated documentation (Need more help?):

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Development under Documents.
Expand the 3.2. Model Evaluation section.
Confirm that perf_comparison_result and roc_curve_result display in this section as expected.

Next steps

Now that you know how to run comparison tests with the ValidMind Library, you’re ready to take the next step. Extend the functionality of run_test() with your own custom test functions that can be incorporated into documentation templates just like any default out-of-the-box ValidMind test.

Learn how to implement custom tests with the ValidMind Library.

Check out our Implement comparison tests notebook for code examples and usage of key functions.

Discover more learning resources

We offer many interactive notebooks to help you automate testing, documenting, validating, and more:

Or, visit our documentation to learn more about ValidMind.

Upgrade ValidMind

After installing ValidMind, you’ll want to periodically make sure you are on the latest version to access any new features and other enhancements.

Retrieve the information for the currently installed version of ValidMind:

%pip show validmind

If the version returned is lower than the version indicated in our production open-source code, restart your notebook and run:

%pip install --upgrade validmind

You may need to restart your kernel after running the upgrade package for changes to be applied.