EU AI Act Compliance — Read our original regulation brief on how the EU AI Act aims to balance innovation with safety and accountability, setting standards for responsible AI use

ValidMind for model development 3 — Integrate custom tests

Learn how to use ValidMind for your end-to-end model documentation process with our series of four introductory notebooks. In this third notebook, supplement ValidMind tests with your own and include them as additional evidence in your documentation.

This notebook assumes that you already have a repository of custom made tests considered critical to include in your documentation. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:

For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.

Learn by doing

Our course tailor-made for developers new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Developer Fundamentals

Prerequisites

In order to integrate custom tests with your model documentation with this notebook, you'll need to first have:

Need help with the above steps?

Refer to the first two notebooks in this series:

Setting up

This section should be quite familiar to you — as we performed the same actions in the previous notebook, 2 — Start the model development process.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Inventory and select the model you registered for this "ValidMind for model development" series of notebooks.

  3. Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)
Note: you may need to restart the kernel to use updated packages.
2025-05-22 18:09:09,497 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model development (ID: cmalgf3qi02ce199qm3rdkl46)
📁 Document Type: model_documentation

Import sample dataset

Next, we'll import the same public Bank Customer Churn Prediction dataset from Kaggle we used in the last notebook so that we have something to work with:

from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

We'll apply a simple rebalancing technique to the dataset before continuing:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Remove highly correlated features

Let's also quickly remove highly correlated features from the dataset using the output from a ValidMind test.

As you learned previously, before we can run tests you'll need to initialize a ValidMind dataset object:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

With our balanced dataset initialized, we can then run our test and utilize the output to help us identify the features we want to remove:

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

High Pearson Correlation ❌

High Pearson Correlation is designed to measure the linear relationship between features in a dataset, with the main goal of identifying high correlations that might indicate feature redundancy or multicollinearity. Identification of such issues allows developers and risk management teams to properly deal with potential impacts on the machine learning model's performance and interpretability.

The test operates by generating pairwise Pearson correlations for all features in the dataset, then sorting and eliminating duplicate and self-correlations. It assigns a Pass or Fail based on whether the absolute value of the correlation coefficient surpasses a pre-set threshold, which is defaulted at 0.3. The Pearson correlation coefficient quantifies the degree of linear relationship between two variables, ranging from -1 to 1, where values closer to 1 or -1 indicate a strong linear relationship, and values near 0 suggest no linear relationship. The test returns the top n strongest correlations, regardless of passing or failing status, where n is 10 by default but can be configured by passing the top_n_correlations parameter.

The primary advantages of this test include providing a quick and simple means of identifying relationships between feature pairs. It generates a transparent output that displays pairs of correlated variables, the Pearson correlation coefficient, and a Pass or Fail status for each. This aids in the early identification of potential multicollinearity issues that may disrupt model training, allowing for timely intervention to maintain model performance and interpretability.

It should be noted that the test can only delineate linear relationships, failing to shed light on nonlinear relationships or dependencies. It is sensitive to outliers, where a few outliers could notably affect the correlation coefficient. Additionally, it is limited to identifying redundancy only within feature pairs and may fail to spot more complex relationships among three or more variables, which could also impact model performance.

This test shows the results in a table format, where each row represents a pair of features, the Pearson correlation coefficient between them, and a Pass or Fail status based on the pre-set threshold of 0.3. The table includes columns for the feature pairs, the calculated correlation coefficient, and the Pass/Fail status. The coefficients range from -0.1906 to 0.3757, indicating varying degrees of linear relationships between the feature pairs. The notable observation is that only one pair, (Age, Exited), fails the test with a coefficient of 0.3757, suggesting a potential issue with multicollinearity or feature redundancy. The other pairs pass the test, indicating that their correlations do not exceed the threshold and are less likely to pose a risk of multicollinearity.

The test results reveal the following key insights:

  • Age and Exited Correlation: The pair (Age, Exited) shows a correlation coefficient of 0.3757, which exceeds the threshold, indicating a potential multicollinearity issue.
  • Balance and NumOfProducts Correlation: The pair (Balance, NumOfProducts) has a correlation coefficient of -0.1906, which passes the test, suggesting a weaker linear relationship.
  • IsActiveMember and Exited Correlation: The pair (IsActiveMember, Exited) shows a correlation coefficient of -0.185, passing the test and indicating a weak linear relationship.
  • Balance and Exited Correlation: The pair (Balance, Exited) has a correlation coefficient of 0.1693, passing the test and suggesting a weak linear relationship.
  • NumOfProducts and Exited Correlation: The pair (NumOfProducts, Exited) shows a correlation coefficient of -0.0698, passing the test and indicating a very weak linear relationship.

Based on these results, the test identifies that the feature pair (Age, Exited) exhibits a correlation coefficient that surpasses the threshold, suggesting a potential issue with multicollinearity that could affect model performance and interpretability. The other feature pairs show weaker correlations, passing the test and indicating that they are less likely to contribute to multicollinearity. This suggests that while most feature pairs do not pose a significant risk, the relationship between Age and Exited should be further examined to ensure it does not negatively impact the model's predictive capabilities. The insights gained from this test can guide further analysis and potential feature engineering to address any identified issues.

Test Parameters

{
  "max_threshold": 0.3
}

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3757 Fail
(Balance, NumOfProducts) -0.1906 Pass
(IsActiveMember, Exited) -0.1850 Pass
(Balance, Exited) 0.1693 Pass
(NumOfProducts, Exited) -0.0698 Pass
(Age, Balance) 0.0589 Pass
(Age, NumOfProducts) -0.0498 Pass
(Tenure, EstimatedSalary) 0.0466 Pass
(NumOfProducts, IsActiveMember) 0.0434 Pass
(Balance, HasCrCard) -0.0430 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3757 Fail
1 (Balance, NumOfProducts) -0.1906 Pass
2 (IsActiveMember, Exited) -0.1850 Pass
3 (Balance, Exited) 0.1693 Pass
4 (NumOfProducts, Exited) -0.0698 Pass
5 (Age, Balance) 0.0589 Pass
6 (Age, NumOfProducts) -0.0498 Pass
7 (Tenure, EstimatedSalary) 0.0466 Pass
8 (NumOfProducts, IsActiveMember) 0.0434 Pass
9 (Balance, HasCrCard) -0.0430 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']

We can then re-initialize the dataset with a different input_id and the highly correlated features removed and re-run the test for confirmation:

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

High Pearson Correlation ✅

High Pearson Correlation is designed to measure the linear relationship between features in a dataset, with the main goal of identifying high correlations that might indicate feature redundancy or multicollinearity. Identification of such issues allows developers and risk management teams to properly deal with potential impacts on the machine learning model's performance and interpretability.

The test operates by generating pairwise Pearson correlations for all features in the dataset, then sorting and eliminating duplicate and self-correlations. The Pearson correlation coefficient quantifies the degree of linear relationship between two variables, ranging from -1 to 1, where values close to 1 or -1 indicate a strong linear relationship, and values near 0 suggest no linear relationship. The test assigns a Pass or Fail based on whether the absolute value of the correlation coefficient surpasses a pre-set threshold, defaulted at 0.3. It returns the top n strongest correlations, where n is configurable, to highlight potential multicollinearity or redundancy.

The primary advantages of this test include providing a quick and simple means of identifying relationships between feature pairs. It generates a transparent output that displays pairs of correlated variables, the Pearson correlation coefficient, and a Pass or Fail status for each. This aids in the early identification of potential multicollinearity issues that may disrupt model training, allowing for timely intervention to maintain model performance and interpretability.

It should be noted that the test can only delineate linear relationships, failing to shed light on nonlinear relationships or dependencies. It is sensitive to outliers, where a few outliers could notably affect the correlation coefficient. Additionally, it is limited to identifying redundancy only within feature pairs and may fail to spot more complex relationships among three or more variables, which could also impact model performance.

This test shows the results in a table format, listing feature pairs, their Pearson correlation coefficients, and a Pass or Fail status. Each row represents a pair of features, with the correlation coefficient indicating the strength and direction of their linear relationship. The coefficients range from -0.1906 to 0.1693, all of which are below the threshold of 0.3, resulting in a Pass status for each pair. The table provides a clear view of the linear relationships between features, with negative values indicating inverse relationships and positive values indicating direct relationships. Notable observations include the highest correlation between Balance and NumOfProducts at -0.1906 and the lowest between Tenure and HasCrCard at 0.0302, both of which are considered weak correlations.

The test results reveal the following key insights:

  • Weak Correlations Across Features: All feature pairs exhibit weak correlations, with coefficients ranging from -0.1906 to 0.1693, indicating no strong linear relationships.
  • Inverse Relationships Identified: Several feature pairs, such as Balance and NumOfProducts, show inverse relationships, though these are weak and unlikely to impact model performance significantly.
  • Pass Status for All Pairs: Every feature pair passes the test, as none exceed the threshold of 0.3, suggesting minimal risk of multicollinearity or redundancy.

Based on these results, the dataset does not exhibit any strong linear relationships between feature pairs, as all correlations are weak and fall below the threshold of 0.3. This suggests a low risk of multicollinearity affecting the model's performance or interpretability. The weak inverse relationships observed, such as between Balance and NumOfProducts, are unlikely to pose significant issues. Overall, the test indicates that the features are relatively independent in terms of linear relationships, which is favorable for model training and interpretation.

Test Parameters

{
  "max_threshold": 0.3
}

Tables

Columns Coefficient Pass/Fail
(Balance, NumOfProducts) -0.1906 Pass
(IsActiveMember, Exited) -0.1850 Pass
(Balance, Exited) 0.1693 Pass
(NumOfProducts, Exited) -0.0698 Pass
(Tenure, EstimatedSalary) 0.0466 Pass
(NumOfProducts, IsActiveMember) 0.0434 Pass
(Balance, HasCrCard) -0.0430 Pass
(Tenure, IsActiveMember) -0.0360 Pass
(CreditScore, Exited) -0.0344 Pass
(Tenure, HasCrCard) 0.0302 Pass

Train the model

We'll then use ValidMind tests to train a simple logistic regression model on our prepared dataset:

# First encode the categorical features in our dataset with the highly correlated features removed
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
2327 632 6 111589.33 1 1 1 170382.99 0 False False True
2942 670 6 158719.57 1 1 1 118607.40 0 False True True
3314 627 7 147361.57 1 1 1 133031.96 0 True False False
6423 684 4 207034.96 2 0 0 157694.76 1 False True True
2359 511 5 68375.27 1 1 0 193160.25 1 False False True
# Split the processed dataset into train and test
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
from sklearn.linear_model import LogisticRegression

# Logistic Regression grid params
log_reg_params = {
    "penalty": ["l1", "l2"],
    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "solver": ["liblinear"],
}

# Grid search for Logistic Regression
from sklearn.model_selection import GridSearchCV

grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(X_train, y_train)

# Logistic Regression best estimator
log_reg = grid_log_reg.best_estimator_

Initialize the ValidMind objects

Let's initialize the ValidMind Dataset and Model objects in preparation for assigning model predictions to each dataset:

# Initialize the datasets into their own dataset objects
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

# Initialize a model object
vm_model = vm.init_model(log_reg, input_id="log_reg_model_v1")

Assign predictions

Once the model is registered, we'll assign predictions to the training and test datasets:

vm_train_ds.assign_predictions(model=vm_model)
vm_test_ds.assign_predictions(model=vm_model)
2025-05-22 18:09:34,940 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2025-05-22 18:09:34,942 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2025-05-22 18:09:34,942 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2025-05-22 18:09:34,945 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2025-05-22 18:09:34,947 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2025-05-22 18:09:34,948 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2025-05-22 18:09:34,949 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2025-05-22 18:09:34,950 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Implementing a custom inline test

With the set up out of the way, let's implement a custom inline test that calculates the confusion matrix for a binary classification model.

  • An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
  • You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.

Create a confusion matrix plot

Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:

import matplotlib.pyplot as plt
from sklearn import metrics

# Get the predicted classes
y_pred = log_reg.predict(vm_test_ds.x)

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

cm_display = metrics.ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix, display_labels=[False, True]
)
cm_display.plot()

Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:

  • The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
    • VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
    • VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
  • The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
  • The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
  • The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
  • The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

You can now run the newly created custom test on both the training and test datasets using the run_test() function:

# Training dataset
result = vm.tests.run_test(
    "my_custom_tests.ConfusionMatrix:training_dataset",
    inputs={"model": vm_model, "dataset": vm_train_ds},
)

Confusion Matrix Training Dataset

Confusion Matrix: Training Dataset is designed to evaluate the performance of a classification model by comparing predicted and actual outcomes.

The test operates by constructing a 2x2 table that includes True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). These values help calculate key metrics such as accuracy, precision, recall, and F1 score. Accuracy measures the overall correctness, precision indicates the proportion of true positive results in all positive predictions, recall shows the ability to identify all positive instances, and F1 score balances precision and recall. These metrics provide a comprehensive view of the model's performance, with values typically ranging from 0 to 1, where higher values indicate better performance.

The primary advantages of this test include its ability to provide a detailed breakdown of model performance across different prediction outcomes. It allows for the identification of specific areas where the model excels or needs improvement, such as distinguishing between false positives and false negatives. This granularity is particularly useful for models where the cost of different types of errors varies, enabling targeted improvements and better decision-making.

It should be noted that the confusion matrix does not account for the probability of predictions, which can be a limitation in models where confidence levels are important. Additionally, it is sensitive to class imbalance, where one class may dominate the predictions, skewing the results. This can lead to misleading interpretations if not considered alongside other metrics or techniques that address imbalance.

This test shows a visual representation of the confusion matrix with values: TP = 816, TN = 853, FP = 443, and FN = 413. The matrix is color-coded, with a gradient indicating the frequency of each outcome. The x-axis represents predicted labels, while the y-axis shows true labels. The color intensity reflects the count, with brighter colors indicating higher values. The matrix provides a clear view of the model's performance, highlighting areas of strength and potential improvement.

The test results reveal the following key insights:

  • High True Negative Rate: The model correctly identifies 853 negative instances, indicating strong performance in recognizing true negatives.
  • Balanced True Positive Rate: With 816 true positives, the model shows a balanced ability to correctly predict positive instances.
  • Significant False Positives: There are 443 false positives, suggesting room for improvement in reducing incorrect positive predictions.
  • Moderate False Negatives: The presence of 413 false negatives indicates a need to enhance the model's sensitivity to positive instances.

Based on these results, the model demonstrates a solid ability to correctly classify both positive and negative instances, with a relatively balanced performance across true positives and true negatives. However, the presence of a notable number of false positives and false negatives suggests areas for potential refinement. The insights from the confusion matrix can guide targeted adjustments to improve precision and recall, ultimately enhancing the model's overall effectiveness in classification tasks.

Figures

# Test dataset
result = vm.tests.run_test(
    "my_custom_tests.ConfusionMatrix:test_dataset",
    inputs={"model": vm_model, "dataset": vm_test_ds},
)

Confusion Matrix Test Dataset

Confusion Matrix Test Dataset is designed to evaluate the performance of a classification model by comparing predicted and true labels.

The test operates by constructing a 2x2 table that includes True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). These values help calculate metrics like accuracy, precision, recall, and F1 score. Accuracy measures the overall correctness, precision indicates the proportion of true positive results in all positive predictions, recall shows the ability to find all relevant cases, and F1 score balances precision and recall. These metrics provide a comprehensive view of the model's performance, with values typically ranging from 0 to 1, where higher values indicate better performance.

The primary advantages of this test include its ability to provide a detailed breakdown of model performance across different prediction outcomes. It allows for the identification of specific areas where the model excels or needs improvement, such as distinguishing between false positives and false negatives. This granularity is particularly useful for models deployed in critical applications where understanding the nature of errors is crucial.

It should be noted that the confusion matrix can be limited by its dependence on the threshold used for classification, which can affect the balance between precision and recall. Additionally, it does not account for the cost of different types of errors, which can vary significantly depending on the application. Interpretation can also be challenging in imbalanced datasets where one class dominates, potentially skewing the perceived performance.

This test shows a confusion matrix plot with predicted labels on the x-axis and true labels on the y-axis. The matrix displays four key values: 205 True Negatives, 115 False Positives, 116 False Negatives, and 201 True Positives. The color gradient indicates the frequency of each outcome, with yellow representing higher values. This visualization helps in quickly assessing the model's performance, with the diagonal elements (TP and TN) indicating correct predictions and off-diagonal elements (FP and FN) indicating errors.

The test results reveal the following key insights:

  • High True Negative Rate: The model correctly identifies 205 negative cases, indicating strong performance in recognizing true negatives.
  • Balanced True Positive Rate: With 201 true positives, the model shows a balanced ability to correctly predict positive cases.
  • Moderate False Positive Rate: The presence of 115 false positives suggests some over-prediction of positive cases.
  • Comparable False Negative Rate: The model has 116 false negatives, indicating a similar level of under-prediction for positive cases.

Based on these results, the model demonstrates a balanced performance with a slight tendency towards false positives and false negatives. The high true negative and true positive rates suggest that the model is generally effective in classification, but there is room for improvement in reducing errors. The insights from the confusion matrix can guide adjustments in model thresholds or training data to enhance precision and recall, particularly in applications where the cost of false predictions is significant.

Figures

Add parameters to custom tests

Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model, normalize=False):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    if normalize:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")
    else:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

Pass parameters to custom tests

You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.

  • The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
  • Since these are VMDataset or VMModel inputs, they have a special meaning.
  • When declaring a dataset, model, datasets or models argument in a custom test function, the ValidMind Library will expect these get passed as inputs to run_test() or run_documentation_tests().

Re-running the confusion matrix with normalize=True and our testing dataset looks like this:

# Test dataset with normalize=True
result = vm.tests.run_test(
    "my_custom_tests.ConfusionMatrix:test_dataset_normalized",
    inputs={"model": vm_model, "dataset": vm_test_ds},
    params={"normalize": True}
)

Confusion Matrix Test Dataset Normalized

<ResponseFormat> **Confusion Matrix** is designed to evaluate the performance of a classification model by comparing predicted and true labels.

The test operates by constructing a 2x2 table that includes True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). These values help calculate key metrics such as accuracy, precision, recall, and F1 score. Accuracy measures the overall correctness, precision indicates the proportion of true positive results in all positive predictions, recall shows the ability to identify all positive instances, and F1 score balances precision and recall. These metrics provide a comprehensive view of the model's performance, with values typically ranging from 0 to 1, where higher values indicate better performance.

The primary advantages of this test include its ability to provide a detailed breakdown of model performance across different prediction outcomes. It allows for the identification of specific areas where the model excels or needs improvement, such as distinguishing between false positives and false negatives. This granularity is particularly useful for models deployed in critical applications where understanding specific error types is crucial. Additionally, the confusion matrix supports the calculation of multiple performance metrics, offering a holistic view of the model's effectiveness.

It should be noted that the confusion matrix has limitations, such as not being suitable for multi-class classification without extension. It also does not account for the cost of different types of errors, which can be significant in certain applications. The matrix's interpretation can be challenging if the dataset is imbalanced, as accuracy may be misleading. Furthermore, the confusion matrix does not provide insights into the model's performance on unseen data, which requires additional validation techniques.

This test shows a normalized confusion matrix plot, where each cell represents the proportion of predictions. The matrix is divided into four sections: True Positive (0.31), True Negative (0.32), False Positive (0.18), and False Negative (0.19). The color gradient indicates the proportion, with yellow representing higher values and purple lower values. The axes are labeled with predicted and true labels, providing a clear visual representation of the model's performance. The normalization allows for easy comparison of the model's ability to predict true and false labels, regardless of the dataset size.

The test results reveal the following key insights:

  • Balanced True Predictions: The model shows a relatively balanced ability to predict true positives (0.31) and true negatives (0.32).
  • Moderate False Predictions: The false positive rate (0.18) and false negative rate (0.19) are moderate, indicating areas for potential improvement.
  • Overall Performance: The model demonstrates a reasonable balance between precision and recall, as indicated by the similar values across the matrix.

Based on these results, the model exhibits a balanced performance in predicting both true positives and true negatives, with moderate rates of false predictions. This suggests that while the model is generally effective, there is room for improvement in reducing false positives and false negatives. The normalized values provide a clear understanding of the model's strengths and weaknesses, guiding further refinement and optimization efforts. The insights gained from this test are crucial for enhancing the model's predictive accuracy and reliability in practical applications. </ResponseFormat>

Test Parameters

{
  "normalize": true
}

Figures

Log the confusion matrix results

As we learned in 2 — Start the model development process under Documenting results > Run and log an individual tests, you can log any result to the ValidMind Platform with the .log() method of the result object, allowing you to then add the result to the documentation.

You can now do the same for the confusion matrix results:

result.log()
2025-05-22 18:10:36,476 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_dataset_normalized does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for this particular test ID.

That's expected, as when we run individual tests the results logged need to be manually added to your documentation within the ValidMind Platform.

Using external test providers

Creating inline custom tests with a function is a great way to customize your model documentation. However, sometimes you may want to reuse the same set of tests across multiple models and share them with others in your organization. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.

In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:

  1. Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
  2. Save an inline test to a file
  3. Define and register a LocalTestProvider that points to that folder
  4. Run test provider tests
  5. Add the test results to your documentation

Create custom tests folder

Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.

The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:

tests_folder = "my_tests"

import os

# create tests folder
os.makedirs(tests_folder, exist_ok=True)

# remove existing tests
for f in os.listdir(tests_folder):
    # remove files and pycache
    if f.endswith(".py") or f == "__pycache__":
        os.system(f"rm -rf {tests_folder}/{f}")

After running the command above, confirm that a new my_tests directory was created successfully. For example:

~/notebooks/tutorials/model_development/my_tests/

Save an inline test

The @vm.test decorator we used in Implementing a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.

While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.

The confusion_matrix test requires the following additional imports:

import matplotlib.pyplot as plt
from sklearn import metrics

Let's pass these imports to the save() method to ensure they are included in the file with the following command:

confusion_matrix.save(
    # Save it to the custom tests folder we created
    tests_folder,
    imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],
)
2025-05-22 18:10:37,058 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_development/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2025-05-22 18:10:37,059 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
  • # Saved from __main__.confusion_matrix
    # Original Test ID: my_custom_tests.ConfusionMatrix
    # New Test ID: <test_provider_namespace>.ConfusionMatrix
  • def ConfusionMatrix(dataset, model, normalize=False):

Register a local test provider

Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:

  • ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
  • You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
Want to learn more about test providers?

An extended introduction to test providers can be found in: Integrate external test providers

Initialize a local test provider

For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.

The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.

Let's go ahead and load the custom tests from our my_tests directory:

from validmind.tests import LocalTestProvider

# initialize the test provider with the tests folder we created earlier
my_test_provider = LocalTestProvider(tests_folder)

vm.tests.register_test_provider(
    namespace="my_test_provider",
    test_provider=my_test_provider,
)
# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`
# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file

Run test provider tests

Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:

  • For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
  • For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.

Let's go ahead and re-run the confusion matrix test with our testing dataset by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.

result = vm.tests.run_test(
    "my_test_provider.ConfusionMatrix",
    inputs={"model": vm_model, "dataset": vm_test_ds},
    params={"normalize": True},
)

result.log()

Confusion Matrix

Confusion Matrix is designed to describe the performance of a classification model by comparing predicted and true values.

The test operates by using a 2x2 table to display True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). These values help calculate metrics like accuracy, precision, recall, and F1 score. Accuracy measures the overall correctness, precision indicates the proportion of true positive results in all positive predictions, recall shows the ability to find all relevant cases, and F1 score balances precision and recall. These metrics provide a comprehensive view of model performance, with values typically ranging from 0 to 1, where higher values indicate better performance.

The primary advantages of this test include its ability to provide a detailed breakdown of model performance across different prediction outcomes. It allows for the calculation of multiple performance metrics from a single table, offering insights into both the strengths and weaknesses of the model. This makes it particularly useful for understanding how well a model distinguishes between classes and for identifying areas where the model may need improvement.

It should be noted that the confusion matrix can be limited by its focus on binary classification, which may not fully capture the performance of models dealing with multiple classes. Additionally, it does not account for the costs of different types of errors, which can be significant in certain applications. Interpretation can also be challenging if the dataset is imbalanced, as accuracy may be misleading.

This test shows a confusion matrix plot with predicted labels on the x-axis and true labels on the y-axis. The matrix is color-coded, with values ranging from 0.18 to 0.32, indicating the proportion of each category. The top left cell represents True Negatives (0.32), the top right False Positives (0.18), the bottom left False Negatives (0.19), and the bottom right True Positives (0.31). The color gradient helps visualize the distribution of predictions, with yellow indicating higher values and purple lower values. This visualization aids in quickly assessing the model's performance across different prediction outcomes.

The test results reveal the following key insights:

  • Balanced True Positives and Negatives: The model shows a relatively balanced number of true positives (0.31) and true negatives (0.32), indicating a fair ability to correctly classify both classes.
  • Moderate False Positives and Negatives: The false positive rate (0.18) and false negative rate (0.19) are moderate, suggesting some room for improvement in reducing incorrect predictions.

Based on these results, the model demonstrates a balanced performance in correctly identifying both positive and negative cases, with a moderate level of incorrect predictions. The insights suggest that while the model is effective in distinguishing between classes, there is potential to enhance its precision and recall by addressing the false positive and false negative rates. This balanced performance indicates a well-rounded model, but further refinement could improve its accuracy and reliability in specific applications.

Test Parameters

{
  "normalize": true
}

Figures

2025-05-22 18:10:53,752 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix does not exist in model's document
Again, note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for this particular test ID.

That's expected, as when we run individual tests the results logged need to be manually added to your documentation within the ValidMind Platform.

Add test results to documentation

With our custom tests run and results logged to the ValidMind Platform, let's head to the model we connected to at the beginning of this notebook and insert our test results into the documentation (Need more help?):

  1. From the Inventory in the ValidMind Platform, go to the model you connected to earlier.

  2. In the left sidebar that appears for your model, click Documentation.

  3. Locate the Data Preparation section and click on 3.2. Model Evaluation to expand that section.

  4. Hover under the Pearson Correlation Matrix content block until a horizontal dashed line with a + button appears, indicating that you can insert a new block.

    Screenshot showing insert block button in model documentation

  5. Click + and then select Test-Driven Block under FROM LIBRARY:

    • Click on Custom under TEST-DRIVEN in the left sidebar.
    • Select the two custom ConfusionMatrix tests you logged above:

    Screenshot showing the ConfusionMatrix tests selected

  6. Finally, click Insert 2 Test Results to Document to add the test results to the documentation.

    Confirm that the two individual results for the confusion matrix tests have been correctly inserted into section 3.2. Model Evaluation of the documentation.

In summary

In this third notebook, you learned how to:

Next steps

Finalize testing and documentation

Now that you're proficient at using the ValidMind Library to run and log tests, let's put the last pieces in place to prepare our fully documented sample model for review: 4 — Finalize testing and documentation