ValidMind for validation 4 — Finalize testing and reporting

Learn how to use ValidMind for your end-to-end validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.

This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:

For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to finalize validation and reporting, you'll need to first have:

Need help with the above steps?

Refer to the first three notebooks in this series:

Setting up

This section should be very familiar to you now — as we performed the same actions in the previous two notebooks in this series.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

  1. On the left sidebar that appears for your model, select Getting Started and select Validation from the DOCUMENT drop-down menu.

  2. Click Copy snippet to clipboard.

  3. Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    document="validation-report",
)
Note: you may need to restart the kernel to use updated packages.
2026-05-26 22:15:48,709 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}
# Initialize the raw dataset for use in ValidMind tests
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)
import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)
# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships between features to identify highly correlated variable pairs that may indicate redundancy or multicollinearity. The result table lists the top reported feature pairs with their Pearson correlation coefficients and Pass/Fail status against the configured absolute correlation threshold of 0.3. Reported coefficients range from -0.1763 to 0.3406, with one pair exceeding the threshold and the remaining pairs falling below it.

Key insights:

  • One pair exceeds threshold: The pair (Age, Exited) has a Pearson correlation coefficient of 0.3406, which is above the configured threshold of 0.3 and is the only entry marked Fail in the reported results.
  • All other reported correlations are low: The remaining nine reported pairs are marked Pass, with absolute correlation values no greater than 0.1763, including (Balance, NumOfProducts) at -0.1763 and (IsActiveMember, Exited) at -0.1752.
  • Reported relationships are mostly weak: Aside from (Age, Exited), the listed coefficients are clustered close to zero, including (CreditScore, Exited) at -0.05, (Age, Balance) at 0.0453, and (Balance, HasCrCard) at -0.036.

The reported correlation structure is concentrated in one threshold breach and otherwise consists of weak pairwise linear relationships. Among the displayed results, (Age, Exited) is the only pair identified as highly correlated under the configured test criterion, while the other listed feature pairs remain below the threshold. This indicates that the observed high-correlation signal in the reported output is isolated rather than widespread across the shown variables.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Age, Exited) 0.3406 Fail
(Balance, NumOfProducts) -0.1763 Pass
(IsActiveMember, Exited) -0.1752 Pass
(Balance, Exited) 0.1634 Pass
(NumOfProducts, Exited) -0.0706 Pass
(NumOfProducts, IsActiveMember) 0.0530 Pass
(CreditScore, Exited) -0.0500 Pass
(Age, Balance) 0.0453 Pass
(Age, NumOfProducts) -0.0439 Pass
(Balance, HasCrCard) -0.0360 Pass
# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df
Columns Coefficient Pass/Fail
0 (Age, Exited) 0.3406 Fail
1 (Balance, NumOfProducts) -0.1763 Pass
2 (IsActiveMember, Exited) -0.1752 Pass
3 (Balance, Exited) 0.1634 Pass
4 (NumOfProducts, Exited) -0.0706 Pass
5 (NumOfProducts, IsActiveMember) 0.0530 Pass
6 (CreditScore, Exited) -0.0500 Pass
7 (Age, Balance) 0.0453 Pass
8 (Age, NumOfProducts) -0.0439 Pass
9 (Balance, HasCrCard) -0.0360 Pass
# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features
['Age']
# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)
# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify potentially redundant variables or multicollinearity. The result table lists the top feature pairs by absolute Pearson correlation coefficient, along with each pair’s coefficient and Pass/Fail status relative to the configured threshold of 0.3. In this run, the reported coefficients range from -0.1763 to 0.1634 across the ten strongest observed pairs. All listed feature pairs are marked as Pass.

Key insights:

  • No pair exceeds threshold: All reported absolute correlation coefficients are below the 0.3 threshold used in this test, and every listed pair is classified as Pass.
  • Strongest relationship is limited: The largest absolute coefficient is -0.1763 for (Balance, NumOfProducts), indicating that the strongest observed linear relationship in the reported output remains low in magnitude.
  • Exited correlations are modest: Feature pairs involving Exited show coefficients of -0.1752 with IsActiveMember, 0.1634 with Balance, -0.0706 with NumOfProducts, and -0.05 with CreditScore, all of which are below the test threshold.
  • Reported correlations are concentrated near zero: Several of the listed top pairs are close to zero, including (Balance, HasCrCard) at -0.036, (HasCrCard, IsActiveMember) at -0.0309, (CreditScore, EstimatedSalary) at -0.0263, and (CreditScore, Balance) at 0.0223.

The reported correlation structure shows low pairwise linear association among the listed feature combinations. The strongest observed correlations remain materially below the configured threshold, and no feature pair in the reported top results is flagged by the test. Taken together, the output indicates that the test did not identify high Pearson correlations among the reported feature pairs.

Parameters:

{
  "max_threshold": 0.3
}
            

Tables

Columns Coefficient Pass/Fail
(Balance, NumOfProducts) -0.1763 Pass
(IsActiveMember, Exited) -0.1752 Pass
(Balance, Exited) 0.1634 Pass
(NumOfProducts, Exited) -0.0706 Pass
(NumOfProducts, IsActiveMember) 0.0530 Pass
(CreditScore, Exited) -0.0500 Pass
(Balance, HasCrCard) -0.0360 Pass
(HasCrCard, IsActiveMember) -0.0309 Pass
(CreditScore, EstimatedSalary) -0.0263 Pass
(CreditScore, Balance) 0.0223 Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()
CreditScore Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited Geography_Germany Geography_Spain Gender_Male
1904 706 8 95386.82 1 1 1 75732.25 0 False True False
1620 707 3 0.00 2 1 0 174303.29 0 False False True
3406 536 7 178011.50 2 1 0 22375.14 0 False False True
4819 544 5 105245.21 1 0 0 99922.08 1 True False True
3364 589 1 0.00 1 0 0 125939.22 1 False True False
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]
# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion submitted by the development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Train potential challenger model

We'll also train our random forest classification challenger to see how it compares:

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Initialize the ValidMind models

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)
# Assign predictions to Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Assign predictions to Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)
2026-05-26 22:16:04,576 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-05-26 22:16:04,578 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-05-26 22:16:04,579 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-05-26 22:16:04,582 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-05-26 22:16:04,584 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-05-26 22:16:04,586 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-05-26 22:16:04,587 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-05-26 22:16:04,589 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-05-26 22:16:04,592 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-05-26 22:16:04,616 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-05-26 22:16:04,618 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-05-26 22:16:04,644 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-05-26 22:16:04,647 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-05-26 22:16:04,660 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-05-26 22:16:04,661 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-05-26 22:16:04,674 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Implementing custom tests

Thanks to the documentation (Learn more: ValidMind for development), we know that the development team implemented a custom test to further evaluate the performance of the champion.

In a usual validation situation, you would load a saved custom test provided by the development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.

Want to learn more about custom tests?

Refer to our in-depth introduction to custom tests: Implement custom tests

Implement a custom inline test

Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the development team used in their performance evaluations.

  • An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
  • You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.

Create a confusion matrix plot

Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:

import matplotlib.pyplot as plt
from sklearn import metrics

# Get the predicted classes
y_pred = log_reg.predict(vm_test_ds.x)

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

cm_display = metrics.ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix, display_labels=[False, True]
)
cm_display.plot()

Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:

  • The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
    • VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
    • VMDataset objects allow you to access the predictions for a given record (model) by accessing the .y_pred() method.
  • The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
  • The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
  • The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
  • The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:

# Champion train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates classification performance by comparing predicted labels with observed labels across the train and test datasets. The results are shown as 2x2 matrices with counts for true negatives, false positives, false negatives, and true positives. For the training dataset, the matrix contains 848 true negatives, 459 false positives, 476 false negatives, and 802 true positives. For the test dataset, the matrix contains 211 true negatives, 98 false positives, 131 false negatives, and 207 true positives.

Key insights:

  • Correct classifications exceed errors: In both datasets, the diagonal cells are larger than the off-diagonal cells. Training results show 848 true negatives and 802 true positives versus 459 false positives and 476 false negatives, while test results show 211 true negatives and 207 true positives versus 98 false positives and 131 false negatives.

  • Negative-class identification is slightly stronger: True negatives exceed true positives in both datasets, with 848 versus 802 in training and 211 versus 207 in testing. This indicates slightly higher correct identification of the negative class than the positive class in the observed results.

  • False negatives exceed false positives on test data: On the test dataset, false negatives total 131 compared with 98 false positives. In contrast, the training dataset shows a narrower difference, with 476 false negatives and 459 false positives.

The confusion matrix results show that the model produces more correct than incorrect classifications on both the training and test datasets. Performance appears directionally similar across datasets, with true negatives and true positives representing the largest cell counts in each matrix. The test sample shows a more noticeable imbalance between the two error types, with false negatives exceeding false positives by a wider margin than in training.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:champion:e390
ValidMind Figure my_custom_tests.ConfusionMatrix:champion:51b7
2026-05-26 22:16:15,138 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document
# Challenger train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix test evaluates classification performance by comparing predicted labels with true labels across the train and test datasets. The result is presented as two 2x2 matrices showing counts for true negatives, false positives, false negatives, and true positives. For the training dataset, the matrix entries are 1,307 true negatives, 0 false positives, 1 false negative, and 1,277 true positives. For the test dataset, the matrix entries are 221 true negatives, 88 false positives, 98 false negatives, and 240 true positives.

Key insights:

  • Near-perfect training classification: The training confusion matrix shows 1,307 true negatives and 1,277 true positives, with only 1 false negative and 0 false positives. Misclassification on the training dataset is therefore minimal.

  • Test errors occur in both directions: On the test dataset, both false positives and false negatives are present at meaningful levels, with 88 false positives and 98 false negatives. This indicates that misclassification is not concentrated in only one prediction direction.

  • Correct test predictions remain the largest cells: The largest counts in the test confusion matrix are the correctly classified observations, with 221 true negatives and 240 true positives. Despite the presence of errors, correct classifications exceed either error category individually.

  • Performance differs materially between train and test: The training matrix shows almost no classification error, while the test matrix shows substantially higher error counts in both off-diagonal cells. The difference is visible directly from the contrast between the nearly diagonal-only training matrix and the more dispersed test matrix.

Taken together, the confusion matrices show that the challenger model classifies the training data almost perfectly, with only a single error and no false positives. On the test data, classification remains concentrated in the correct prediction cells, but both false positive and false negative counts increase materially relative to training. The overall result reflects a clear difference between in-sample and out-of-sample classification behavior.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:9fbb
ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:080d
2026-05-26 22:16:25,557 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Add parameters to custom tests

Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model, normalize=False):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    if normalize:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")
    else:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

Pass parameters to custom tests

You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.

  • The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
  • Since these are VMDataset or VMModel inputs, they have a special meaning.

Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:

# Champion with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Champion

The Confusion Matrix test evaluates classification performance by comparing predicted labels with true labels, and this result presents the normalized confusion matrix for the champion logistic model on the test dataset. The matrix displays the proportion of observations in each true/predicted label combination, with values shown directly in the four cells. The observed normalized cell values are 0.33 for true negatives, 0.15 for false positives, 0.20 for false negatives, and 0.32 for true positives.

Key insights:

  • Correct classifications dominate overall: The diagonal cells sum to 0.65, with 0.33 in the true negative cell and 0.32 in the true positive cell, indicating that most observations are classified correctly.
  • Error types are unevenly distributed: The off-diagonal cells total 0.35, with false negatives at 0.20 and false positives at 0.15, showing a modestly higher share of missed positive cases than incorrectly assigned positive cases.
  • Diagonal balance is relatively even: The two correct-classification cells are close in magnitude, differing by 0.01 between true negatives (0.33) and true positives (0.32), indicating similar normalized representation of correct predictions across both classes.

The normalized confusion matrix shows that correct predictions account for the majority of outcomes, with a nearly even split between correctly identified negative and positive cases. Misclassifications are lower in aggregate, though false negatives occur somewhat more frequently than false positives. Overall, the result reflects a balanced pattern of correct classification with a moderate asymmetry in error distribution.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_champion:b109
2026-05-26 22:16:33,644 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document
# Challenger with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Challenger

The ConfusionMatrix test evaluates classification performance by comparing predicted and true class labels in a normalized 2x2 matrix. The displayed result is a normalized confusion matrix for dataset=test_dataset_final and model=rf_model, with rows representing true labels and columns representing predicted labels. The matrix entries are 0.34 for true negatives, 0.14 for false positives, 0.15 for false negatives, and 0.37 for true positives. Because normalization is enabled, the values are presented as proportions rather than raw counts.

Key insights:

  • Correct classifications dominate the matrix: The diagonal cells sum to 0.71, comprising 0.34 true negatives and 0.37 true positives, while the off-diagonal error cells sum to 0.29.
  • True positives are the largest component: The highest single cell is the true positive rate at 0.37, slightly above the true negative cell at 0.34.
  • False negatives slightly exceed false positives: The false negative cell is 0.15 compared with 0.14 for false positives, indicating very similar error contributions with a marginally higher share of missed positive cases.
  • Predicted positives exceed predicted negatives: Summing by prediction column gives 0.51 for predicted True and 0.49 for predicted False, showing a near-balanced prediction split with a slight tilt toward positive predictions.

The normalized confusion matrix shows that most observations fall on the diagonal, with 71% classified correctly and 29% allocated to error cells. Correct positive classifications represent the largest individual share, followed closely by correct negative classifications. Misclassification is distributed fairly evenly across false positives and false negatives, with only a small difference between the two error types.

Parameters:

{
  "normalize": true
}
            

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_challenger:d45d
2026-05-26 22:16:42,121 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document

Use external test providers

Sometimes you may want to reuse the same set of custom tests across multiple records (models) and share them with others in your organization, like the development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.

In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:

  1. Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
  2. Save an inline test to a file
  3. Define and register a LocalTestProvider that points to that folder
  4. Run test provider tests
  5. Add the test results to your documentation

Create custom tests folder

Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.

The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:

tests_folder = "my_tests"

import os

# create tests folder
os.makedirs(tests_folder, exist_ok=True)

# remove existing tests
for f in os.listdir(tests_folder):
    # remove files and pycache
    if f.endswith(".py") or f == "__pycache__":
        os.system(f"rm -rf {tests_folder}/{f}")

After running the command above, confirm that a new my_tests directory was created successfully. For example:

~/notebooks/tutorials/validation/my_tests/

Save an inline test

The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.

While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.

The confusion_matrix test requires the following additional imports:

import matplotlib.pyplot as plt
from sklearn import metrics

Let's pass these imports to the save() method to ensure they are included in the file with the following command:

confusion_matrix.save(
    # Save it to the custom tests folder we created
    tests_folder,
    imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],
)
2026-05-26 22:16:42,597 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-05-26 22:16:42,598 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
  • # Saved from __main__.confusion_matrix
    # Original Test ID: my_custom_tests.ConfusionMatrix
    # New Test ID: <test_provider_namespace>.ConfusionMatrix
  • def ConfusionMatrix(dataset, model, normalize=False):

Register a local test provider

Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:

  • ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
  • You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
Want to learn more about test providers?

An extended introduction to test providers can be found in: Integrate external test providers
Initialize a local test provider

For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.

The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.

Let's go ahead and load the custom tests from our my_tests directory:

from validmind.tests import LocalTestProvider

# initialize the test provider with the tests folder we created earlier
my_test_provider = LocalTestProvider(tests_folder)

vm.tests.register_test_provider(
    namespace="my_test_provider",
    test_provider=my_test_provider,
)
# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`
# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file
Run test provider tests

Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:

  • For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
  • For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.

Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.

# Champion with test dataset and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates classification outcomes by comparing predicted labels against true labels in a 2x2 table of correct and incorrect predictions. For the test_dataset_final results shown for log_model_champion, the matrix reports 211 true negatives, 98 false positives, 131 false negatives, and 207 true positives. The figure therefore summarizes model performance across both classes by showing the distribution of correct classifications on the diagonal and misclassifications on the off-diagonal.

Key insights:

  • Correct predictions exceed errors: The model records 418 correct classifications in total, comprising 211 true negatives and 207 true positives, versus 229 misclassifications from 98 false positives and 131 false negatives.
  • False negatives exceed false positives: The number of false negatives is 131, compared with 98 false positives, indicating more missed positive cases than incorrect positive assignments.
  • Balanced correct classifications across classes: True negatives and true positives are similar in magnitude at 211 and 207 respectively, showing comparable volumes of correct predictions for the negative and positive classes.
  • Negative class is more frequent: The confusion matrix totals 309 observations for the negative class (211 + 98) and 338 observations for the positive class (131 + 207), indicating a modestly higher count of positive true labels.

The confusion matrix shows that correct classifications are more numerous than errors, with similar counts of true negatives and true positives. Misclassification is present in both directions, with false negatives occurring more often than false positives. Overall, the result reflects a model that identifies both classes with comparable volumes of correct predictions while retaining a nontrivial level of classification error in the test dataset.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:champion:2095
2026-05-26 22:16:49,438 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document
# Challenger with test dataset  and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix test evaluates classification performance by comparing predicted labels with observed labels on the test dataset for the challenger model (rf_model). The 2x2 matrix reports 221 true negatives, 240 true positives, 88 false positives, and 98 false negatives. These results show the distribution of correct and incorrect classifications across both classes, with the largest counts in the correctly classified cells and error counts present in both off-diagonal cells.

Key insights:

  • Correct classifications exceed errors: The matrix contains 221 true negatives and 240 true positives, compared with 88 false positives and 98 false negatives. Correct predictions total 461 observations versus 186 misclassifications.
  • Positive class is identified slightly more often: True positives (240) are higher than true negatives (221), indicating more correct classifications for the positive class than for the negative class in absolute count terms.
  • Both error types are material: False negatives total 98 and false positives total 88. Misclassification is therefore present on both sides of the decision boundary, with false negatives slightly exceeding false positives.
  • Prediction counts are relatively balanced: Predicted positive outcomes total 328 (240 true positives + 88 false positives), while predicted negative outcomes total 319 (221 true negatives + 98 false negatives). This indicates a near-even split between predicted classes.

The confusion matrix indicates that the challenger model produces more correct than incorrect classifications on the test dataset, with substantial counts in both true positive and true negative cells. Misclassifications occur in both directions, and the false negative count is modestly higher than the false positive count. Overall, the result reflects a relatively balanced prediction pattern across classes with stronger counts in the correctly classified cells.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:challenger:8560
2026-05-26 22:16:56,360 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document

Verify test runs

Our final task is to verify that all the tests provided by the development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.

Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:

test_config = {
    # Run with the raw dataset
    'validmind.data_validation.DatasetDescription:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.DescriptiveStatistics:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.MissingValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percentage_threshold': 1}
    },
    'validmind.data_validation.ClassImbalance:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.Duplicates:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.HighCardinality:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {
            'num_threshold': 100,
            'percent_threshold': 0.1,
            'threshold_type': 'percent'
        }
    },
    'validmind.data_validation.Skewness:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_threshold': 1}
    },
    'validmind.data_validation.UniqueRows:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TooManyZeroValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_percent_threshold': 0.03}
    },
    'validmind.data_validation.IQROutliersTable:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'threshold': 5}
    },
    # Run with the preprocessed dataset
    'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.MissingValues:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'min_percentage_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'default_column': 'loan_status'}
    },
    # Run with the training and test datasets
    'validmind.data_validation.DescriptiveStatistics:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.TabularDescriptionTables:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.ClassImbalance:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.UniqueRows:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.MutualInformation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_threshold': 0.01}
    },
    'validmind.data_validation.PearsonCorrelationMatrix:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.HighPearsonCorrelation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'max_threshold': 0.3, 'top_n_correlations': 10}
    },
    'validmind.model_validation.ModelMetadata': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ModelParameters': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ROCCurve': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']}
    },
    'validmind.model_validation.sklearn.MinimumROCAUCScore': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']},
        'params': {'min_threshold': 0.5}
    }
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")
validmind.data_validation.DatasetDescription:raw_data

Dataset Description Raw Data

The Dataset Description test evaluates column-level completeness, data types, and cardinality across the raw dataset. The results summarize 11 variables over 8,000 records, showing each column’s inferred type together with counts, missingness, and distinct-value levels. The table includes six numeric variables and five categorical variables, with all columns reported at full row count and no missing values.

Key insights:

  • No missing values observed: All 11 columns have a count of 8,000 with 0 missing values and 0.0% missingness, indicating complete population across the raw dataset.
  • Mixed numeric and categorical structure: The dataset contains 6 numeric columns (CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary) and 5 categorical columns (Geography, Gender, HasCrCard, IsActiveMember, Exited), reflecting a mixed feature set.
  • EstimatedSalary is fully unique: EstimatedSalary has 8,000 distinct values out of 8,000 records, corresponding to a distinct proportion of 1.0, making it the highest-cardinality field in the dataset.
  • Balance also shows high cardinality: Balance contains 5,088 distinct values, representing 63.6% of the dataset, which is materially higher than the distinct-value levels of the other numeric variables.
  • Several variables are low-cardinality: Geography has 3 distinct values, Gender, HasCrCard, IsActiveMember, and Exited each have 2, NumOfProducts has 4, and Tenure has 11, indicating multiple fields with a limited set of observed values.

The dataset description indicates complete raw data coverage with no observed missingness across any column. Feature structure is split between numeric and categorical variables, with cardinality ranging from binary categorical fields to fully unique values in EstimatedSalary. The most prominent concentration of distinct values appears in EstimatedSalary and Balance, while several operational and demographic fields exhibit very limited category counts.

Tables

Dataset Description

Name Type Count Missing Missing % Distinct Distinct %
CreditScore Numeric 8000.0 0 0.0 452 0.0565
Geography Categorical 8000.0 0 0.0 3 0.0004
Gender Categorical 8000.0 0 0.0 2 0.0002
Age Numeric 8000.0 0 0.0 69 0.0086
Tenure Numeric 8000.0 0 0.0 11 0.0014
Balance Numeric 8000.0 0 0.0 5088 0.6360
NumOfProducts Numeric 8000.0 0 0.0 4 0.0005
HasCrCard Categorical 8000.0 0 0.0 2 0.0002
IsActiveMember Categorical 8000.0 0 0.0 2 0.0002
EstimatedSalary Numeric 8000.0 0 0.0 8000 1.0000
Exited Categorical 8000.0 0 0.0 2 0.0002
2026-05-26 22:17:03,612 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:raw_data

Descriptive Statistics Raw Data

The Descriptive Statistics test evaluates the distributional characteristics of the raw dataset across numerical and categorical variables. The results summarize central tendency, spread, percentile ranges, and observed category frequencies for eight numerical fields and two categorical fields, each with 8,000 recorded observations. Numerical summaries include means, standard deviations, and distribution quantiles from the minimum through the 95th percentile and maximum, while categorical summaries report unique value counts and the most frequent category with its share of records.

Key insights:

  • Balance shows pronounced dispersion: Balance has a mean of 76,434.10, a median of 97,264, and a standard deviation of 62,612.25, with the 25th percentile at 0 and the maximum at 250,898. This indicates a wide spread and a substantial concentration of lower-end values relative to the median.

  • CreditScore and Age are broadly distributed: CreditScore ranges from 350 to 850 with a mean of 650.16 and median of 652, while Age ranges from 18 to 92 with a mean of 38.95 and median of 37. In both variables, mean and median are relatively close, indicating limited shift in central tendency relative to their overall ranges.

  • NumOfProducts is concentrated at low values: NumOfProducts has a mean of 1.53 and median of 1, with the 75th, 90th, and 95th percentiles at 2 and a maximum of 4. Most observations are concentrated in the lower part of the available range.

  • Binary indicators are moderately imbalanced: HasCrCard has a mean of 0.7026, indicating that approximately 70.26% of records take the value 1, while IsActiveMember has a mean of 0.5199, indicating a near-even split with a slight majority of 1s.

  • EstimatedSalary spans nearly the full observed range: EstimatedSalary ranges from 12 to 199,992, with a mean of 99,790.19 and median of 99,505. The quartiles at 50,857 and 149,216 show a broad distribution around the center.

  • Categorical concentration is moderate: Geography contains 3 unique values, with France as the most frequent category at 4,010 records or 50.12%. Gender contains 2 unique values, with Male as the top category at 4,396 records or 54.95%, indicating moderate concentration rather than dominance by a single category.

The descriptive profile shows complete recorded observations across all reported variables and a mix of distribution shapes across the dataset. CreditScore, Age, and EstimatedSalary exhibit central tendencies that are closely aligned with their medians, while Balance displays substantially wider dispersion and a lower quartile at zero. Product ownership and binary indicator variables are concentrated in a limited set of values, and the categorical fields show moderate concentration in the leading categories without extreme single-category dominance.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 8000.0 650.1596 96.8462 350.0 583.0 652.0 717.0 778.0 813.0 850.0
Age 8000.0 38.9489 10.4590 18.0 32.0 37.0 44.0 53.0 60.0 92.0
Tenure 8000.0 5.0339 2.8853 0.0 3.0 5.0 8.0 9.0 9.0 10.0
Balance 8000.0 76434.0965 62612.2513 0.0 0.0 97264.0 128045.0 149545.0 162488.0 250898.0
NumOfProducts 8000.0 1.5325 0.5805 1.0 1.0 1.0 2.0 2.0 2.0 4.0
HasCrCard 8000.0 0.7026 0.4571 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 8000.0 0.5199 0.4996 0.0 0.0 1.0 1.0 1.0 1.0 1.0
EstimatedSalary 8000.0 99790.1880 57520.5089 12.0 50857.0 99505.0 149216.0 179486.0 189997.0 199992.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 8000.0 3.0 France 4010.0 50.12
Gender 8000.0 2.0 Male 4396.0 54.95
2026-05-26 22:17:13,352 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document
validmind.data_validation.MissingValues:raw_data

✅ Missing Values Raw Data

The Missing Values test evaluates dataset completeness by measuring the percentage of missing values in each feature against the configured 1% threshold. The results are presented for 11 raw data columns, with each row showing the count of missing values, the corresponding missing-value percentage, and the pass/fail outcome. All listed features, including both predictors and the target variable, show 0 missing values and a missing-value percentage of 0.0%, resulting in a pass status for every column.

Key insights:

  • No missing values detected: All 11 columns report 0 missing values and 0.0% missingness in the raw dataset.
  • All features passed threshold: Every column received a Pass result under the configured 1% missing-value threshold.
  • Completeness is uniform across fields: Missingness results are identical across all reported variables, including CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited.

The test results show complete observed data coverage across all evaluated raw data fields, with no column exhibiting any missing entries. Relative to the configured 1% threshold, missingness remains at 0.0% for every feature, and the dataset passes this completeness check without exceptions.

Parameters:

{
  "min_percentage_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Age 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-05-26 22:17:17,786 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document
validmind.data_validation.ClassImbalance:raw_data

✅ Class Imbalance Raw Data

The Class Imbalance test evaluates the distribution of target classes in the raw dataset by measuring each class’s share of total records against the configured minimum threshold of 10%. The results for the Exited target show two classes, with class 0 representing 79.80% of rows and class 1 representing 20.20% of rows. The accompanying table and bar chart present these class proportions and the corresponding pass/fail status for each class.

Key insights:

  • Both classes pass threshold: Class 0 at 79.80% and class 1 at 20.20% both exceed the configured 10% minimum threshold, and both are marked as Pass.
  • Majority class is class 0: The target distribution is concentrated in class 0, which accounts for 79.80% of observations, compared with 20.20% for class 1.
  • Observed class split is uneven: The class shares differ by 59.60 percentage points, indicating that the dataset contains substantially more observations for class 0 than for class 1.

The results show that both target classes satisfy the test’s minimum representation criterion under the 10% threshold. At the same time, the raw target distribution is weighted toward class 0, with roughly four times as many observations as class 1. This indicates that the dataset passes the configured class presence check while exhibiting a clear majority/minority class structure.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

Exited Class Imbalance

Exited Percentage of Rows (%) Pass/Fail
0 79.80% Pass
1 20.20% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:raw_data:0a66
2026-05-26 22:17:27,594 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document
validmind.data_validation.Duplicates:raw_data

✅ Duplicates Raw Data

The Duplicates test evaluates whether the raw dataset contains exact duplicate rows by counting duplicate records and expressing them as a share of total rows. The result table reports the number of duplicates and the percentage of rows identified as duplicates for the dataset. In this run, the table shows 0 duplicate rows and a duplicate-row percentage of 0.0%.

Key insights:

  • No duplicate rows detected: The dataset contains 0 duplicate rows, indicating that no exact row-level repetitions were identified by this test.
  • Duplicate share is zero: The percentage of rows flagged as duplicates is 0.0%, showing that duplicates were absent in proportional as well as absolute terms.
  • Result is below threshold: With min_threshold set to 1, the observed duplicate count of 0 falls below the configured threshold used by the test.

The duplicate-row assessment shows no exact duplicates in the raw dataset under the test configuration used. Both the absolute duplicate count and the duplicate percentage are zero, and the observed result is below the configured threshold. Collectively, the result indicates that this specific row-level data quality check did not identify redundant records in the dataset.

Parameters:

{
  "min_threshold": 1
}
            

Tables

Duplicate Rows Results for Dataset

Number of Duplicates Percentage of Rows (%)
0 0.0
2026-05-26 22:17:34,177 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document
validmind.data_validation.HighCardinality:raw_data

✅ High Cardinality Raw Data

The High Cardinality test evaluates the number of unique values in categorical columns to identify columns with unusually large distinct-value counts. The result table reports the categorical columns assessed, their number of distinct values, the percentage of distinct values, and the resulting pass/fail outcome against the configured threshold. In this run, two categorical columns were evaluated: Geography with 3 distinct values and 0.0375% distinctness, and Gender with 2 distinct values and 0.025% distinctness. Both columns are marked as passing.

Key insights:

  • No categorical failures observed: Both evaluated categorical columns, Geography and Gender, passed the test based on their reported distinct-value counts.
  • Low distinct-value counts across categories: Geography contains 3 distinct values and Gender contains 2, indicating limited category breadth in the assessed categorical fields.
  • Distinctness percentages remain small: Reported percentages of distinct values are 0.0375% for Geography and 0.025% for Gender, both well below the configured 0.1 threshold.

The test results show that the evaluated categorical columns exhibit low cardinality under the configured threshold settings. Both Geography and Gender remain below the reported percentage threshold and pass the assessment, with no categorical column flagged for elevated distinct-value counts in this test run.

Parameters:

{
  "num_threshold": 100,
  "percent_threshold": 0.1,
  "threshold_type": "percent"
}
            

Tables

Column Number of Distinct Values Percentage of Distinct Values (%) Pass/Fail
Geography 3 0.0375 Pass
Gender 2 0.0250 Pass
2026-05-26 22:17:38,845 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document
validmind.data_validation.Skewness:raw_data

❌ Skewness Raw Data

The Skewness test evaluates asymmetry in the distribution of numeric variables by comparing each column’s skewness against the configured maximum threshold of 1. The results table reports skewness values and pass/fail outcomes for nine numeric columns in the raw dataset. Seven columns pass the threshold, while two columns exceed it: Age with skewness of 1.0245 and Exited with skewness of 1.4847. The remaining variables show skewness values ranging from -0.8867 to 0.7172.

Key insights:

  • Two variables exceed threshold: Age and Exited are the only columns that fail the test, with skewness values of 1.0245 and 1.4847 respectively, both above the maximum threshold of 1.
  • Exited shows highest skewness: Exited has the largest absolute skewness in the dataset at 1.4847, making it the most asymmetric numeric variable in the test output.
  • Most variables are low-skew: Seven of nine numeric columns pass the test, including CreditScore, Tenure, Balance, IsActiveMember, and EstimatedSalary, all of which have skewness values close to zero.
  • Negative skewness is present but within limit: HasCrCard (-0.8867), Balance (-0.1353), CreditScore (-0.062), and IsActiveMember (-0.0796) all show negative skewness, but each remains within the allowed threshold.
  • Moderate positive skewness appears in product count: NumOfProducts records skewness of 0.7172, indicating positive asymmetry that remains below the failure cutoff.

The test results show that skewness is limited for most numeric fields in the dataset, with seven of nine variables remaining within the configured threshold. The most material exceptions are Exited and Age, which exceed the threshold and represent the only failed columns in the assessment. Outside these two variables, the observed skewness values are generally modest, with several columns exhibiting distributions close to symmetric.

Parameters:

{
  "max_threshold": 1
}
            

Tables

Skewness Results for Dataset

Column Skewness Pass/Fail
CreditScore -0.0620 Pass
Age 1.0245 Fail
Tenure 0.0077 Pass
Balance -0.1353 Pass
NumOfProducts 0.7172 Pass
HasCrCard -0.8867 Pass
IsActiveMember -0.0796 Pass
EstimatedSalary 0.0095 Pass
Exited 1.4847 Fail
2026-05-26 22:17:45,562 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document
validmind.data_validation.UniqueRows:raw_data

❌ Unique Rows Raw Data

The UniqueRows test evaluates dataset diversity by comparing the proportion of unique values in each column against the configured minimum threshold of 1%. The result table reports the number and percentage of unique values for 11 columns and assigns a pass or fail outcome for each column. Three columns exceed the threshold and pass the test, while the remaining eight columns fall below the threshold and fail. Reported uniqueness ranges from 0.025% to 100.0% across the evaluated columns.

Key insights:

  • Uniqueness is concentrated in three columns: EstimatedSalary records 8,000 unique values and 100.0% uniqueness, Balance records 5,088 unique values and 63.6% uniqueness, and CreditScore records 452 unique values and 5.65% uniqueness. These are the only columns that pass the 1% threshold.

  • Most columns fall below threshold: Eight of the 11 evaluated columns fail the test, including Geography, Gender, Age, Tenure, NumOfProducts, HasCrCard, IsActiveMember, and Exited. Their uniqueness percentages range from 0.025% to 0.8625%, all below the configured minimum.

  • Several variables have very limited distinct values: Gender, HasCrCard, IsActiveMember, and Exited each contain 2 unique values corresponding to 0.025% uniqueness, while Geography contains 3 unique values and NumOfProducts contains 4 unique values. These columns show the lowest diversity by this test’s measurement.

  • Age is the closest failing column: Age contains 69 unique values, representing 0.8625% uniqueness. This is the highest uniqueness percentage among the failing columns, but it remains below the 1% threshold.

The results show a mixed uniqueness profile across the dataset under the configured threshold. High uniqueness is observed in EstimatedSalary, Balance, and CreditScore, while most remaining columns exhibit low distinct-value counts relative to the total row count and therefore fail the test. Overall, the test outcome is driven by a small number of highly unique continuous variables alongside a majority of columns with limited observed distinct values.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
CreditScore 452 5.6500 Pass
Geography 3 0.0375 Fail
Gender 2 0.0250 Fail
Age 69 0.8625 Fail
Tenure 11 0.1375 Fail
Balance 5088 63.6000 Pass
NumOfProducts 4 0.0500 Fail
HasCrCard 2 0.0250 Fail
IsActiveMember 2 0.0250 Fail
EstimatedSalary 8000 100.0000 Pass
Exited 2 0.0250 Fail
2026-05-26 22:17:57,638 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document
validmind.data_validation.TooManyZeroValues:raw_data

❌ Too Many Zero Values Raw Data

The TooManyZeroValues test evaluates numerical columns for excessive concentrations of zero values relative to a specified threshold. In this run, the threshold was set to 0.03%, and the result table reports zero-value counts, percentages, and pass/fail outcomes for each assessed numerical variable. Four variables are listed in the output, each with 8,000 rows, and all four exceed the configured threshold. Reported zero-value percentages range from 4.0375% to 48.0125%.

Key insights:

  • All assessed variables failed: Tenure, Balance, HasCrCard, and IsActiveMember each exceeded the 0.03% threshold and were marked as Fail in the test output.
  • IsActiveMember has the highest zero concentration: IsActiveMember contains 3,841 zero values out of 8,000 rows, corresponding to 48.0125%, the largest share among the assessed variables.
  • Balance also shows substantial zeros: Balance contains 2,912 zero values, representing 36.4% of observations, indicating a large zero-valued segment in this field.
  • Zero prevalence varies materially across variables: Reported zero-value percentages span from 4.0375% for Tenure to 48.0125% for IsActiveMember, showing notable dispersion in zero concentration across the numerical fields reviewed.
  • Tenure is lowest but still above threshold: Tenure has 323 zero values, equal to 4.0375% of rows, which is the smallest proportion in the table but remains above the configured limit.

The test result shows that every numerical variable included in this output exceeded the configured zero-value threshold of 0.03%. Zero concentrations are highest in IsActiveMember and Balance, while Tenure has the lowest observed zero share. Collectively, the results indicate that zero values are present at nontrivial levels across all assessed numerical fields in the raw dataset.

Parameters:

{
  "max_percent_threshold": 0.03
}
            

Tables

Variable Row Count Number of Zero Values Percentage of Zero Values (%) Pass/Fail
Tenure 8000 323 4.0375 Fail
Balance 8000 2912 36.4000 Fail
HasCrCard 8000 2379 29.7375 Fail
IsActiveMember 8000 3841 48.0125 Fail
2026-05-26 22:18:05,159 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document
validmind.data_validation.IQROutliersTable:raw_data

IQR Outliers Table Raw Data

The Interquartile Range Outliers Table test evaluates numerical features for observations falling outside the IQR-based outlier bounds. The result is presented as a summary table of outliers detected by the IQR method using a threshold parameter of 5. In this execution, the raw result table titled Summary of Outliers Detected by IQR Method contains no rows, indicating that no feature-level outlier summaries are reported in the output.

Key insights:

  • No outlier records reported: The outlier summary table is empty, with no numerical features listed and no associated outlier counts or summary statistics shown.
  • No feature-level outlier detail available: Because the table contains no entries, no minimum, quartile, median, or maximum values for outlier observations are present in the result.

The test output shows no reported outlier summaries in the generated IQR outlier table. Based on the provided result, the execution does not display any feature-level evidence of detected outliers or any associated descriptive statistics for outlier values.

Parameters:

{
  "threshold": 5
}
            

Tables

Summary of Outliers Detected by IQR Method

2026-05-26 22:18:08,959 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:preprocessed_data

Descriptive Statistics Preprocessed Data

The Descriptive Statistics test evaluates the distributional characteristics of numerical and categorical variables in the preprocessed dataset. The results present summary statistics for seven numerical variables and frequency-based summaries for two categorical variables, each based on 3,232 observations. Numerical outputs include central tendency, dispersion, and percentile ranges, while categorical outputs show the number of unique values, the most frequent category, and its share of the sample.

Key insights:

  • No missingness in reported fields: All numerical and categorical variables show a count of 3,232, indicating complete observations across the reported preprocessed fields.
  • Balance shows pronounced lower-end concentration: Balance has a mean of 81,006.3706, a median of 103,081.0, and a 25th percentile of 0.0, indicating a substantial concentration of observations at zero or near-zero values alongside a broad upper range extending to 250,898.0.
  • EstimatedSalary is broadly dispersed but centered: EstimatedSalary has a mean of 100,150.5317 and a median of 100,502.0, with a standard deviation of 57,381.2377 and values spanning from 12.0 to 199,992.0, showing wide spread with closely aligned central tendency measures.
  • CreditScore is broadly distributed across its range: CreditScore ranges from 350.0 to 850.0, with a mean of 650.1624, median of 652.0, and standard deviation of 98.4398, indicating broad dispersion with mean and median closely aligned.
  • NumOfProducts is concentrated in lower values: NumOfProducts has a median of 1.0, a 75th percentile of 2.0, and a maximum of 4.0, showing that most observations are concentrated in the lower part of the available range.
  • Categorical distributions are moderately concentrated: Geography contains 3 unique values, with France as the most common category at 46.47%, while Gender contains 2 unique values, with Male at 51.76%, indicating neither categorical field is dominated by a single category to an extreme degree.

The descriptive profile shows complete reported coverage across all summarized fields and a mix of distribution shapes across variables. CreditScore and EstimatedSalary have mean and median values that are closely aligned despite broad ranges, while Balance stands out for its lower-end concentration and wide dispersion. The categorical variables exhibit limited cardinality but relatively balanced leading-category shares, with the largest category remaining below half of the sample in Geography and only slightly above half in Gender.

Tables

Numerical Variables

Name Count Mean Std Min 25% 50% 75% 90% 95% Max
CreditScore 3232.0 650.1624 98.4398 350.0 583.0 652.0 718.0 779.0 816.0 850.0
Tenure 3232.0 5.0108 2.9064 0.0 3.0 5.0 8.0 9.0 10.0 10.0
Balance 3232.0 81006.3706 61459.7618 0.0 0.0 103081.0 128564.0 149153.0 162786.0 250898.0
NumOfProducts 3232.0 1.5195 0.6710 1.0 1.0 1.0 2.0 2.0 3.0 4.0
HasCrCard 3232.0 0.6977 0.4593 0.0 0.0 1.0 1.0 1.0 1.0 1.0
IsActiveMember 3232.0 0.4567 0.4982 0.0 0.0 0.0 1.0 1.0 1.0 1.0
EstimatedSalary 3232.0 100150.5317 57381.2377 12.0 50878.0 100502.0 149432.0 179355.0 190176.0 199992.0

Categorical Variables

Name Count Number of Unique Values Top Value Top Value Frequency Top Value Frequency %
Geography 3232.0 3.0 France 1502.0 46.47
Gender 3232.0 2.0 Male 1673.0 51.76
2026-05-26 22:18:17,227 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:preprocessed_data

Tabular Description Tables Preprocessed Data

The Descriptive Statistics test evaluates the composition, completeness, and basic distributional properties of the preprocessed dataset. The results summarize numerical variables by observation count, mean, minimum, maximum, missingness, and data type, and summarize categorical variables by observation count, number of unique values, listed categories, missingness, and data type. The reported output covers eight numerical variables and two categorical variables, with each field showing 3,232 observations and associated summary statistics.

Key insights:

  • No missing values reported: All numerical and categorical variables show 0.0% missing values across the 3,232 observations in the preprocessed dataset.
  • Consistent observation counts across fields: Each reported variable, including CreditScore, Balance, EstimatedSalary, Geography, and Gender, has 3,232 observations, indicating a uniform row count across the summarized dataset.
  • Binary indicators are numerically encoded: HasCrCard, IsActiveMember, and Exited are stored as int64 with minimum 0.0 and maximum 1.0. Their means of 0.6977, 0.4567, and 0.5 respectively indicate the average proportion of records taking the value 1.
  • Exited is evenly distributed: The target variable Exited has mean 0.5 with values ranging from 0.0 to 1.0, indicating an even split between the two classes in the summarized data.
  • Categorical cardinality is low: Geography contains 3 unique values (Spain, France, Germany) and Gender contains 2 unique values (Female, Male), with both variables stored as object.
  • Numeric ranges vary substantially across variables: CreditScore ranges from 350.0 to 850.0, Tenure from 0.0 to 10.0, Balance from 0.0 to 250,898.09, and EstimatedSalary from 11.58 to 199,992.48, showing materially different scales across the numerical inputs.

The descriptive statistics indicate that the preprocessed dataset is complete across all reported fields and consistently populated at 3,232 observations per variable. The summarized structure includes a mix of continuous numerical fields, integer-count variables, binary encoded indicators, and low-cardinality categorical variables. The target variable is balanced in the reported sample, and the numerical fields span markedly different value ranges, which characterizes the scale profile of the dataset captured by this test.

Tables

Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
CreditScore 3232 650.1624 350.00 850.00 0.0 int64
Tenure 3232 5.0108 0.00 10.00 0.0 int64
Balance 3232 81006.3706 0.00 250898.09 0.0 float64
NumOfProducts 3232 1.5195 1.00 4.00 0.0 int64
HasCrCard 3232 0.6977 0.00 1.00 0.0 int64
IsActiveMember 3232 0.4567 0.00 1.00 0.0 int64
EstimatedSalary 3232 100150.5317 11.58 199992.48 0.0 float64
Exited 3232 0.5000 0.00 1.00 0.0 int64
Categorical Variable Num of Obs Num of Unique Values Unique Values Missing Values (%) Data Type
Geography 3232.0 3.0 ['Spain' 'France' 'Germany'] 0.0 object
Gender 3232.0 2.0 ['Female' 'Male'] 0.0 object
2026-05-26 22:18:25,476 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document
validmind.data_validation.MissingValues:preprocessed_data

✅ Missing Values Preprocessed Data

The Missing Values test evaluates dataset completeness by measuring the percentage of missing values in each feature against the configured threshold of 1.0%. The results table lists 10 columns in the preprocessed dataset and reports the count and percentage of missing values for each column, together with a pass/fail outcome. All reported features show 0 missing values and 0.0% missingness, with each column marked as Pass.

Key insights:

  • No missing values detected: All 10 evaluated columns report 0 missing values, corresponding to 0.0% missingness in every case.
  • All features passed the threshold: Every column satisfies the 1.0% missing-value threshold, with no failures recorded in the results table.
  • Completeness is uniform across variables: Missingness results are identical across all listed features, including CreditScore, Geography, Gender, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited.

The test results indicate complete observed data coverage across the evaluated preprocessed dataset. No column exhibits missingness, and no feature exceeds the configured threshold. The observed outcome is consistent across all listed predictors and the target variable Exited.

Parameters:

{
  "min_percentage_threshold": 1
}
            

Tables

Column Number of Missing Values Percentage of Missing Values (%) Pass/Fail
CreditScore 0 0.0 Pass
Geography 0 0.0 Pass
Gender 0 0.0 Pass
Tenure 0 0.0 Pass
Balance 0 0.0 Pass
NumOfProducts 0 0.0 Pass
HasCrCard 0 0.0 Pass
IsActiveMember 0 0.0 Pass
EstimatedSalary 0 0.0 Pass
Exited 0 0.0 Pass
2026-05-26 22:18:30,144 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:preprocessed_data

Tabular Numerical Histograms Preprocessed Data

The TabularNumericalHistograms test evaluates the distribution of numerical input features by plotting a histogram for each variable. The results show distinct distribution shapes across the preprocessed dataset, including continuous variables with broad numeric ranges, discrete count-like variables, and binary indicators. The figures cover CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary, allowing direct visual comparison of concentration, spread, and tail behavior for each feature.

Key insights:

  • CreditScore is broadly bell-shaped: CreditScore is distributed across roughly the 350 to 850 range, with the highest concentration in the mid-range around 600 to 700. Counts taper on both lower and upper ends, indicating thinner tails relative to the center.

  • Tenure is nearly uniform across categories: Tenure values from 1 through 9 appear at broadly similar frequencies, while 0 and 10 have visibly lower counts than the interior categories. This indicates a mostly even discrete distribution with reduced mass at the endpoints.

  • Balance shows a zero-heavy pattern: Balance has a very large spike at 0, followed by a separate concentration across positive balances centered roughly around 100k to 150k. The positive-balance portion is spread over a wide range and extends with a thinning right tail toward higher values.

  • NumOfProducts is concentrated in low counts: NumOfProducts is discrete and heavily concentrated at 1 and 2, with 1 occurring most frequently. Values of 3 are much less common, and 4 appears only rarely.

  • Binary indicators are imbalanced to different degrees: HasCrCard is skewed toward the value 1, while IsActiveMember is more balanced but still shows a higher count at 0 than at 1. Both variables are strictly concentrated at the two binary endpoints.

  • EstimatedSalary is approximately uniform: EstimatedSalary is spread fairly evenly across the range from near 0 to 200k, with bar heights that remain relatively consistent across bins. No strong central mode or pronounced skew is visible in this feature.

The histogram results show that the numerical inputs are heterogeneous in form rather than following a common distributional pattern. Several variables are discrete or binary, Balance exhibits a pronounced mass at zero alongside a broad positive distribution, and CreditScore is the clearest example of a unimodal continuous feature. Overall, the plots indicate a mix of approximately uniform, centrally concentrated, and highly uneven distributions across the modeled inputs.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:6c67
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:4620
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:3ae1
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:0e24
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:9e84
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:8924
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:077d
2026-05-26 22:19:02,206 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document
validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data

Tabular Categorical Bar Plots Preprocessed Data

The TabularCategoricalBarPlots test evaluates the composition of categorical features by plotting the count of observations in each category. The result includes bar plots for two categorical variables in the preprocessed dataset: Geography and Gender. The Geography plot shows counts for France, Germany, and Spain, while the Gender plot shows counts for Male and Female, allowing direct visual comparison of category representation across these features.

Key insights:

  • France is the largest geography: France has the highest count among the Geography categories at approximately 1,500 observations, compared with about 1,000 for Germany and about 700 for Spain.
  • Geography distribution is uneven: The Geography feature is not evenly distributed across categories, with a visible decline in counts from France to Germany to Spain.
  • Gender counts are relatively balanced: The Gender feature shows similar representation across categories, with Male at approximately 1,700 observations and Female at approximately 1,550 observations.
  • Limited category cardinality: Both categorical features contain a small number of categories in the displayed plots, with three categories for Geography and two for Gender.

The categorical composition shown in the plots indicates differing distribution patterns across the two features. Geography exhibits a clear concentration in France relative to Germany and Spain, whereas Gender is comparatively balanced between Male and Female. The displayed categorical variables also have low cardinality, making the category distributions straightforward to interpret from the plots.

Figures

ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:7dd0
ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:ee9b
2026-05-26 22:19:26,895 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.TargetRateBarPlots:preprocessed_data

Target Rate Bar Plots Preprocessed Data

The Target Rate Bar Plots test evaluates categorical feature distributions alongside their observed target rates. The results present paired bar charts for the categorical variables Geography and Gender, with one chart showing category counts and the other showing the corresponding mean target rate for each category. For Geography, the categories shown are France, Germany, and Spain; for Gender, the categories shown are Male and Female. The visualizations allow direct comparison of category prevalence and target-rate differences within each feature.

Key insights:

  • Germany shows the highest target rate: Within Geography, Germany has the highest observed target rate at approximately 0.65, compared with about 0.42 for France and about 0.45 for Spain.
  • France is the largest geography segment: France has the highest category count at roughly 1,500 observations, followed by Germany at about 1,000 and Spain at about 750.
  • Female category has higher target rate: Within Gender, Female shows a higher target rate at approximately 0.57, while Male is lower at approximately 0.43.
  • Gender counts are relatively balanced: The count distribution across Gender is close, with Male at roughly 1,650 observations and Female at roughly 1,550.

The results show clear variation in observed target rates across both categorical features. The largest difference appears within Geography, where Germany has a materially higher target rate than France and Spain despite not being the largest category by count. Gender displays a more balanced count distribution, with Female exhibiting a higher target rate than Male. Overall, the plots indicate that both categorical variables differentiate the observed target outcome, with stronger separation visible across geographic categories.

Figures

ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:5fa4
ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:2c47
2026-05-26 22:19:43,331 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document
validmind.data_validation.DescriptiveStatistics:development_data

Descriptive Statistics Development Data

The Descriptive Statistics test evaluates the distributional characteristics of numerical variables in the development dataset and corresponding test split. The results provide summary statistics for seven variables across train_dataset_final and test_dataset_final, including counts, central tendency, dispersion, and selected quantiles. Reported sample sizes are 2,585 observations for the training dataset and 647 observations for the test dataset, with statistics shown separately for each variable in each split.

Key insights:

  • Train and test central tendencies are closely aligned: Means and medians are very similar between training and test datasets across all reported variables. For example, CreditScore has means of 650.40 and 649.21 with medians of 652 and 650, while EstimatedSalary has means of 100,166.22 and 100,087.85 with medians of 100,862 and 97,893.

  • Balance shows the widest dispersion: Balance has the largest standard deviation in both datasets at 61,340.02 in training and 61,914.02 in test. Its distribution also spans from 0 at the minimum and 25th percentile to maxima of 250,898 in training and 216,110 in test, indicating a broad value range.

  • Balance includes a substantial zero-valued segment: For Balance, both datasets have a minimum of 0 and a 25th percentile of 0, while the median exceeds 100,000 in each split. This indicates that at least one quarter of observations are at zero, with the distribution shifting sharply above zero beyond the lower quartile.

  • Binary features show stable proportions: HasCrCard and IsActiveMember display means near binary class proportions and remain close across splits. HasCrCard is 0.6975 in training and 0.6986 in test, while IsActiveMember is 0.4603 in training and 0.4420 in test.

  • CreditScore and Tenure are broadly symmetric: For CreditScore, the mean and median are closely matched in both datasets (650.40 vs. 652.0 in training; 649.21 vs. 650.0 in test). Tenure shows the same pattern with means near 5 and medians of 5 in both splits, alongside similar interquartile ranges.

  • NumOfProducts is concentrated at low values: NumOfProducts has a median of 1 and a 75th percentile of 2 in both datasets, with means of 1.5180 and 1.5255. The 95th percentile is 3 and the maximum is 4 in both splits, showing limited spread and concentration in the lower product counts.

The descriptive statistics indicate that the training and test splits are closely matched across the reported numerical variables, with similar counts, central tendencies, quantiles, and dispersion measures. The most pronounced distributional feature is the wide spread in Balance, including a zero-valued lower quartile and substantially higher values above the median. Other variables, including CreditScore, Tenure, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary, exhibit generally consistent distributions between the two datasets.

Tables

dataset Name Count Mean Std Min 25% 50% 75% 90% 95% Max
train_dataset_final CreditScore 2585.0 650.3996 98.1793 350.0 584.0 652.0 719.0 779.0 816.0 850.0
train_dataset_final Tenure 2585.0 4.9822 2.9005 0.0 2.0 5.0 7.0 9.0 10.0 10.0
train_dataset_final Balance 2585.0 81663.1705 61340.0164 0.0 0.0 103561.0 129189.0 149603.0 162851.0 250898.0
train_dataset_final NumOfProducts 2585.0 1.5180 0.6704 1.0 1.0 1.0 2.0 2.0 3.0 4.0
train_dataset_final HasCrCard 2585.0 0.6975 0.4594 0.0 0.0 1.0 1.0 1.0 1.0 1.0
train_dataset_final IsActiveMember 2585.0 0.4603 0.4985 0.0 0.0 0.0 1.0 1.0 1.0 1.0
train_dataset_final EstimatedSalary 2585.0 100166.2200 57127.9094 123.0 51752.0 100862.0 149097.0 178752.0 189649.0 199992.0
test_dataset_final CreditScore 647.0 649.2148 99.5445 376.0 578.0 650.0 716.0 781.0 826.0 850.0
test_dataset_final Tenure 647.0 5.1252 2.9292 0.0 3.0 5.0 8.0 9.0 10.0 10.0
test_dataset_final Balance 647.0 78382.2164 61914.0155 0.0 0.0 101629.0 126702.0 147107.0 161805.0 216110.0
test_dataset_final NumOfProducts 647.0 1.5255 0.6738 1.0 1.0 1.0 2.0 2.0 3.0 4.0
test_dataset_final HasCrCard 647.0 0.6986 0.4592 0.0 0.0 1.0 1.0 1.0 1.0 1.0
test_dataset_final IsActiveMember 647.0 0.4420 0.4970 0.0 0.0 0.0 1.0 1.0 1.0 1.0
test_dataset_final EstimatedSalary 647.0 100087.8508 58427.1517 12.0 48353.0 97893.0 151129.0 181116.0 190539.0 199418.0
2026-05-26 22:19:57,596 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document
validmind.data_validation.TabularDescriptionTables:development_data

Tabular Description Tables Development Data

The Descriptive Statistics test evaluates the structure, completeness, and basic distributional properties of variables in the development data. The results summarize numerical and categorical fields for both train_dataset_final and test_dataset_final, including observation counts, central tendency, value ranges, missingness, uniqueness, and recorded data types. The numerical summary covers eight variables in each dataset, while the categorical summary covers three boolean indicator variables in each dataset. Across the reported fields, the tables provide a side-by-side view of train and test sample characteristics.

Key insights:

  • No reported missing values: All numerical and categorical variables show Missing Values (%) = 0.0 in both train_dataset_final and test_dataset_final, indicating complete data across the fields included in this summary.

  • Train and test feature ranges are closely aligned: Several numerical variables share identical or near-identical observed ranges across datasets, including Tenure (0 to 10 in both), NumOfProducts (1 to 4 in both), HasCrCard (0 to 1 in both), IsActiveMember (0 to 1 in both), and Exited (0 to 1 in both). CreditScore also shows a similar upper bound of 850 in both datasets, with minimum values of 350 in train and 376 in test.

  • Average values are broadly similar across splits: Mean values for several numerical variables are close between train and test, including CreditScore (650.3996 vs. 649.2148), Tenure (4.9822 vs. 5.1252), NumOfProducts (1.518 vs. 1.5255), HasCrCard (0.6975 vs. 0.6986), and EstimatedSalary (100166.22 vs. 100087.8508). This indicates limited separation in central tendency for these fields between the two datasets.

  • Balance and target means differ modestly: Balance has a mean of 81,663.1705 in the training dataset and 78,382.2164 in the test dataset. The target variable Exited has a mean of 0.4944 in train and 0.5224 in test, showing a higher average event rate in the test sample.

  • Categorical fields are binary boolean indicators: The categorical variables Geography_Germany, Geography_Spain, and Gender_Male each have exactly 2 unique values in both datasets, with unique values reported as True and False and data type bool.

  • Dataset sizes differ by split: The numerical and categorical summaries report 2,585 observations for train_dataset_final and 647 observations for test_dataset_final, establishing the relative sample sizes used in the development data partition.

The descriptive statistics show complete coverage of the reported development variables with no missing values in either split. Train and test datasets exhibit closely aligned ranges and mean values across most numerical fields, while Balance and Exited show modest differences in average levels between splits. The categorical variables are consistently encoded as binary boolean indicators, and the tabulated results present a structurally consistent representation of the development data across training and test samples.

Tables

dataset Numerical Variable Num of Obs Mean Min Max Missing Values (%) Data Type
train_dataset_final CreditScore 2585 650.3996 350.00 850.00 0.0 int64
train_dataset_final Tenure 2585 4.9822 0.00 10.00 0.0 int64
train_dataset_final Balance 2585 81663.1705 0.00 250898.09 0.0 float64
train_dataset_final NumOfProducts 2585 1.5180 1.00 4.00 0.0 int64
train_dataset_final HasCrCard 2585 0.6975 0.00 1.00 0.0 int64
train_dataset_final IsActiveMember 2585 0.4603 0.00 1.00 0.0 int64
train_dataset_final EstimatedSalary 2585 100166.2200 123.07 199992.48 0.0 float64
train_dataset_final Exited 2585 0.4944 0.00 1.00 0.0 int64
test_dataset_final CreditScore 647 649.2148 376.00 850.00 0.0 int64
test_dataset_final Tenure 647 5.1252 0.00 10.00 0.0 int64
test_dataset_final Balance 647 78382.2164 0.00 216109.88 0.0 float64
test_dataset_final NumOfProducts 647 1.5255 1.00 4.00 0.0 int64
test_dataset_final HasCrCard 647 0.6986 0.00 1.00 0.0 int64
test_dataset_final IsActiveMember 647 0.4420 0.00 1.00 0.0 int64
test_dataset_final EstimatedSalary 647 100087.8508 11.58 199418.02 0.0 float64
test_dataset_final Exited 647 0.5224 0.00 1.00 0.0 int64
dataset Categorical Variable Num of Obs Num of Unique Values Unique Values Missing Values (%) Data Type
train_dataset_final Geography_Germany 2585.0 2.0 [False True] 0.0 bool
train_dataset_final Geography_Spain 2585.0 2.0 [False True] 0.0 bool
train_dataset_final Gender_Male 2585.0 2.0 [ True False] 0.0 bool
test_dataset_final Geography_Germany 647.0 2.0 [False True] 0.0 bool
test_dataset_final Geography_Spain 647.0 2.0 [False True] 0.0 bool
test_dataset_final Gender_Male 647.0 2.0 [False True] 0.0 bool
2026-05-26 22:20:08,535 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document
validmind.data_validation.ClassImbalance:development_data

✅ Class Imbalance Development Data

The Class Imbalance test evaluates the distribution of target classes in the development data by measuring the percentage of records in each class and comparing those percentages to the configured minimum threshold of 10%. Results are reported separately for train_dataset_final and test_dataset_final for the target variable Exited. In the training dataset, class Exited=0 represents 50.56% of rows and Exited=1 represents 49.44%, while in the test dataset, class Exited=1 represents 52.24% and Exited=0 represents 47.76%. All class proportions are marked as Pass under the configured threshold criterion.

Key insights:

  • Both classes exceed threshold: Every observed class proportion is well above the 10% minimum threshold in both datasets, and all class-level test results are recorded as Pass.
  • Training data is nearly balanced: In train_dataset_final, the class split is 50.56% for Exited=0 and 49.44% for Exited=1, a difference of 1.12 percentage points.
  • Test data remains balanced: In test_dataset_final, the class split is 52.24% for Exited=1 and 47.76% for Exited=0, a difference of 4.48 percentage points.
  • Class ordering differs across splits: The slight majority class is Exited=0 in the training dataset, while Exited=1 is the slight majority class in the test dataset.

The test results show that the target variable distribution is balanced across both training and test development datasets under the configured 10% minimum class threshold. Class proportions in both splits remain close to an even 50/50 distribution, with only small differences between the two classes. The observed split-level variation is limited to a reversal in the slight majority class between train and test, while all classes continue to satisfy the test criterion.

Parameters:

{
  "min_percent_threshold": 10
}
            

Tables

dataset Exited Percentage of Rows (%) Pass/Fail
train_dataset_final 0 50.56% Pass
train_dataset_final 1 49.44% Pass
test_dataset_final 1 52.24% Pass
test_dataset_final 0 47.76% Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:development_data:6b88
ValidMind Figure validmind.data_validation.ClassImbalance:development_data:b3b7
2026-05-26 22:20:22,731 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document
validmind.data_validation.UniqueRows:development_data

❌ Unique Rows Development Data

The UniqueRows test evaluates column-level data diversity by comparing the percentage of unique values in each column against the minimum threshold of 1%. Results are reported for both train_dataset_final and test_dataset_final, with each column showing its number of unique values, percentage of unique values, and pass/fail outcome. In the training dataset, 3 of 11 columns pass the threshold, while in the test dataset, 4 of 11 columns pass. Reported uniqueness percentages range from 0.0774% to 100.0% across the two datasets.

Key insights:

  • EstimatedSalary is fully unique: EstimatedSalary records 2,585 unique values in train_dataset_final and 647 unique values in test_dataset_final, corresponding to 100.0% unique values in both datasets.
  • Balance and CreditScore pass in both datasets: Balance shows 68.0464% unique values in training and 65.0696% in testing, while CreditScore shows 16.325% in training and 46.3679% in testing; both columns pass the 1% threshold in each dataset.
  • Tenure differs by dataset: Tenure has 11 unique values in both datasets, but the percentage of unique values is 0.4255% in training and 1.7002% in testing, producing a fail in train_dataset_final and a pass in test_dataset_final.
  • Binary and low-cardinality fields fail consistently: HasCrCard, IsActiveMember, Geography_Germany, Geography_Spain, Gender_Male, and Exited each have 2 unique values and fail in both datasets, with uniqueness percentages of 0.0774% in training and 0.3091% in testing.
  • NumOfProducts remains below threshold: NumOfProducts has 4 unique values in both datasets, yielding 0.1547% unique values in training and 0.6182% in testing, which remains below the threshold in both cases.

The results show that uniqueness is concentrated in a small subset of columns, particularly EstimatedSalary, Balance, and CreditScore, with Tenure meeting the threshold only in the test dataset. Most remaining fields exhibit very low uniqueness percentages and fail the 1% threshold in both datasets, including all reported binary indicator columns and NumOfProducts. Overall, the test outcome reflects a mixed uniqueness profile across columns, with passing results limited to higher-cardinality variables.

Parameters:

{
  "min_percent_threshold": 1
}
            

Tables

dataset Column Number of Unique Values Percentage of Unique Values (%) Pass/Fail
train_dataset_final CreditScore 422 16.3250 Pass
train_dataset_final Tenure 11 0.4255 Fail
train_dataset_final Balance 1759 68.0464 Pass
train_dataset_final NumOfProducts 4 0.1547 Fail
train_dataset_final HasCrCard 2 0.0774 Fail
train_dataset_final IsActiveMember 2 0.0774 Fail
train_dataset_final EstimatedSalary 2585 100.0000 Pass
train_dataset_final Geography_Germany 2 0.0774 Fail
train_dataset_final Geography_Spain 2 0.0774 Fail
train_dataset_final Gender_Male 2 0.0774 Fail
train_dataset_final Exited 2 0.0774 Fail
test_dataset_final CreditScore 300 46.3679 Pass
test_dataset_final Tenure 11 1.7002 Pass
test_dataset_final Balance 421 65.0696 Pass
test_dataset_final NumOfProducts 4 0.6182 Fail
test_dataset_final HasCrCard 2 0.3091 Fail
test_dataset_final IsActiveMember 2 0.3091 Fail
test_dataset_final EstimatedSalary 647 100.0000 Pass
test_dataset_final Geography_Germany 2 0.3091 Fail
test_dataset_final Geography_Spain 2 0.3091 Fail
test_dataset_final Gender_Male 2 0.3091 Fail
test_dataset_final Exited 2 0.3091 Fail
2026-05-26 22:20:34,853 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document
validmind.data_validation.TabularNumericalHistograms:development_data

Tabular Numerical Histograms Development Data

The TabularNumericalHistograms test evaluates the distribution of numerical features by plotting feature-wise histograms for the dataset. The results show histograms for development data across both train_dataset_final and test_dataset_final, covering continuous variables (CreditScore, Balance, EstimatedSalary), ordinal or count-like variables (Tenure, NumOfProducts), and binary indicator variables (HasCrCard, IsActiveMember, Geography_Germany, Geography_Spain, Gender_Male). The plots allow direct visual comparison of shape, concentration, and range across these variables in the train and test partitions.

Key insights:

  • CreditScore is broadly unimodal: In both train and test datasets, CreditScore is concentrated in the mid-range, with the highest density around roughly 600 to 700 and thinner tails toward the lower and upper ends of the observed range.

  • Balance shows a spike at zero: Balance exhibits a pronounced mass at 0 in both train and test datasets, alongside a separate concentration centered approximately in the 100k to 150k range, indicating a mixed distribution rather than a single continuous peak.

  • EstimatedSalary is relatively even: EstimatedSalary is spread across the full displayed range up to about 200k in both partitions, with bar heights that remain fairly consistent and no single dominant concentration.

  • NumOfProducts is concentrated at lower values: NumOfProducts takes discrete values from 1 to 4, with the largest count at 1, followed by 2, and substantially smaller counts at 3 and 4 in both train and test datasets.

  • Binary indicators are imbalanced to varying degrees: HasCrCard is more concentrated at 1 than 0 in both datasets, IsActiveMember is closer to balanced with a modestly higher count at 0, and the one-hot encoded geography indicators show more false than true observations for both Geography_Germany and Geography_Spain.

  • Train and test shapes are visually consistent: Across the displayed variables, the train and test histograms show closely aligned ranges and overall distributional forms, including the zero-heavy Balance pattern, the mid-range concentration of CreditScore, the near-uniform EstimatedSalary spread, and the discrete concentration of NumOfProducts.

Overall, the histogram results show that the numerical inputs span a mix of continuous, discrete, and binary distributions, with several variables displaying clear structural features such as zero inflation (Balance) and strong concentration in lower discrete categories (NumOfProducts). The train and test partitions appear visually similar in range and shape across the displayed features, while binary and one-hot encoded variables exhibit differing class balances across indicators. These results document the observed input distribution structure used in model development.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:e980
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:5561
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:ffb0
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:7a38
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:8f00
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:eb7d
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:f458
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:1ac6
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:5d08
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:0574
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:6e47
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:e660
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:5738
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:46a4
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:0326
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:eb19
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:87df
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:404d
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:0c98
ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:eada
2026-05-26 22:21:51,307 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document
2026-05-26 22:22:22,144 - ERROR(validmind.api_client): Error logging test results to ValidMind API
2026-05-26 22:22:22,145 - ERROR(validmind.api_client): Error logging metadata to ValidMind API
Error running test validmind.data_validation.TabularNumericalHistograms:development_data: 
validmind.data_validation.MutualInformation:development_data

Mutual Information Development Data

The Mutual Information test evaluates the statistical dependency between each feature and the target to assess relative feature relevance. The results are presented as ranked bar charts for both train_dataset_final and test_dataset_final, with a minimum threshold of 0.01 shown as a horizontal reference line. In both datasets, mutual information scores are concentrated in a small subset of features, while several features are near or at zero. The ordering and magnitude of feature scores differ between the train and test samples, with NumOfProducts remaining the highest-scoring feature in both.

Key insights:

  • NumOfProducts is the strongest feature: NumOfProducts has the highest mutual information score in both datasets, at approximately 0.10 in the training data and 0.08 in the test data, clearly exceeding all other features.

  • Only a few features exceed threshold: In the training data, NumOfProducts, Balance, IsActiveMember, Geography_Germany, and Gender_Male are above the 0.01 threshold. In the test data, NumOfProducts, Geography_Germany, Balance, and CreditScore are above the threshold.

  • Feature ranking changes across samples: IsActiveMember is above threshold in the training data but appears at or near zero in the test data. Conversely, CreditScore is below threshold in training and slightly above threshold in test.

  • Several features show minimal information content: HasCrCard is at or near zero in both datasets. Geography_Spain is below threshold in both, and EstimatedSalary is below threshold in training and at or near zero in test.

  • Score distribution is highly concentrated: After the top one or two features, mutual information values drop sharply in both charts, indicating that the measured dependency with the target is unevenly distributed across the feature set.

The mutual information results show that predictive dependency is concentrated primarily in NumOfProducts, with a limited number of additional features contributing measurable information above the 0.01 threshold. The train and test charts share the same leading feature but differ in the set of secondary features above threshold, indicating variation in feature relevance across samples. Several variables exhibit consistently low or zero mutual information, while the overall score profile remains strongly skewed toward the top-ranked features.

Parameters:

{
  "min_threshold": 0.01
}
            

Figures

ValidMind Figure validmind.data_validation.MutualInformation:development_data:ea2e
ValidMind Figure validmind.data_validation.MutualInformation:development_data:d41e
2026-05-26 22:22:36,596 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document
2026-05-26 22:22:36,599 - ERROR(validmind.api_client): Error logging figure to ValidMind API
2026-05-26 22:22:36,600 - ERROR(validmind.api_client): Error logging figure to ValidMind API
2026-05-26 22:22:36,600 - ERROR(validmind.api_client): Error logging figure to ValidMind API
2026-05-26 22:22:36,601 - ERROR(validmind.api_client): Error logging figure to ValidMind API
2026-05-26 22:22:36,602 - ERROR(validmind.api_client): Error logging figure to ValidMind API
2026-05-26 22:22:36,603 - ERROR(validmind.api_client): Error logging figure to ValidMind API
2026-05-26 22:22:36,603 - ERROR(validmind.api_client): Error logging figure to ValidMind API
2026-05-26 22:22:36,604 - ERROR(validmind.api_client): Error logging figure to ValidMind API
2026-05-26 22:22:36,605 - ERROR(validmind.api_client): Error logging figure to ValidMind API
validmind.data_validation.PearsonCorrelationMatrix:development_data

Pearson Correlation Matrix Development Data

The Pearson Correlation Matrix test evaluates linear dependency between variables by displaying pairwise Pearson correlation coefficients across the development dataset. The result is shown as correlation heat maps for both the train and test splits, with coefficients ranging from -1 to 1 and diagonal values of 1.0 by construction. Across both panels, most pairwise correlations are clustered near zero, with a limited number of moderate positive or negative relationships visible among specific feature pairs and between selected features and the target Exited.

Key insights:

  • Correlations are generally low: Most pairwise coefficients in both train and test are close to zero, indicating limited linear dependence across the majority of variables shown in the matrix.

  • Strongest feature correlation is moderate: The largest non-diagonal positive correlation is between Balance and Geography_Germany at 0.42 in both train and test. This is the highest observed positive association in the matrix and remains below the 0.7 high-correlation threshold described in the test methodology.

  • Geography indicators are negatively related: Geography_Germany and Geography_Spain show a correlation of -0.36 in both train and test. This is the strongest negative relationship visible in the matrix.

  • Balance and number of products are inversely related: Balance and NumOfProducts show correlations of -0.18 in the train split and -0.17 in the test split, indicating a small but consistent negative linear relationship.

  • Target relationships are weak to modest: The target Exited has its largest positive correlation with Geography_Germany at 0.21 in train and 0.22 in test, and its largest negative correlation with IsActiveMember at -0.18 in train and -0.16 in test. Balance also shows a positive correlation with Exited of 0.17 in train and 0.14 in test, while Gender_Male shows a negative correlation of -0.14 in train and -0.15 in test.

  • Correlation structure is stable across splits: The main observed relationships are highly consistent between train and test, including Balance with Geography_Germany (0.42 in both), Geography_Germany with Geography_Spain (-0.36 in both), and the direction and approximate magnitude of correlations involving Exited.

The correlation analysis shows a predominantly low-correlation feature set, with only a small number of moderate linear relationships present. The strongest observed associations are Balance with Geography_Germany and the negative relationship between the two geography indicators, while correlations with Exited remain weak to modest in magnitude. The same overall pattern is reproduced across train and test splits, indicating a stable correlation structure within the development data.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:032d
ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:8dda
2026-05-26 22:22:56,996 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document
validmind.data_validation.HighPearsonCorrelation:development_data

❌ High Pearson Correlation Development Data

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify highly correlated variable pairs that may indicate redundancy or multicollinearity. Results are reported for both train_dataset_final and test_dataset_final using a maximum threshold of 0.3 and listing the top 10 correlations per dataset. Across both datasets, the output includes the feature pair, Pearson correlation coefficient, and Pass/Fail status, with two feature pairs exceeding the threshold in each dataset. The remaining reported correlations are below the threshold and are marked as passing.

Key insights:

  • Two correlations exceed threshold: In both train_dataset_final and test_dataset_final, 2 of the 10 reported feature pairs fail the 0.3 threshold. These are (Balance, Geography_Germany) with coefficients of 0.4188 and 0.4231, and (Geography_Germany, Geography_Spain) with coefficients of -0.3614 and -0.3631.

  • Top correlations are consistent across datasets: The same two feature pairs are the only failing correlations in both datasets, and their magnitudes are closely aligned between train and test. This indicates a stable correlation pattern for the strongest observed relationships.

  • Remaining relationships are modest: All other listed correlations fall within an absolute range of 0.0603 to 0.2215 across the two datasets and are marked as passing. Examples include (Geography_Germany, Exited) at 0.2073 in train and 0.2215 in test, and (IsActiveMember, Exited) at -0.1775 in train and -0.1645 in test.

  • Direction of association is stable: The sign of the reported correlations is consistent between train and test for repeated feature pairs. Positive relationships such as (Balance, Geography_Germany) and negative relationships such as (Geography_Germany, Geography_Spain) retain the same direction across both datasets.

The test results show a limited set of pairwise linear relationships exceeding the configured 0.3 threshold, with the same two failing feature pairs appearing in both development datasets. The strongest observed relationship is the positive correlation between Balance and Geography_Germany, followed by the negative correlation between Geography_Germany and Geography_Spain. Beyond these pairs, the reported feature relationships are comparatively weak and remain below the threshold, with similar magnitudes and directions across train and test.

Parameters:

{
  "max_threshold": 0.3,
  "top_n_correlations": 10
}
            

Tables

dataset Columns Coefficient Pass/Fail
train_dataset_final (Balance, Geography_Germany) 0.4188 Fail
train_dataset_final (Geography_Germany, Geography_Spain) -0.3614 Fail
train_dataset_final (Geography_Germany, Exited) 0.2073 Pass
train_dataset_final (Balance, NumOfProducts) -0.1789 Pass
train_dataset_final (IsActiveMember, Exited) -0.1775 Pass
train_dataset_final (Balance, Geography_Spain) -0.1768 Pass
train_dataset_final (Balance, Exited) 0.1695 Pass
train_dataset_final (Gender_Male, Exited) -0.1440 Pass
train_dataset_final (NumOfProducts, Exited) -0.0658 Pass
train_dataset_final (CreditScore, Exited) -0.0603 Pass
test_dataset_final (Balance, Geography_Germany) 0.4231 Fail
test_dataset_final (Geography_Germany, Geography_Spain) -0.3631 Fail
test_dataset_final (Geography_Germany, Exited) 0.2215 Pass
test_dataset_final (Balance, NumOfProducts) -0.1661 Pass
test_dataset_final (IsActiveMember, Exited) -0.1645 Pass
test_dataset_final (Gender_Male, Exited) -0.1467 Pass
test_dataset_final (Balance, Exited) 0.1421 Pass
test_dataset_final (HasCrCard, Geography_Spain) -0.1312 Pass
test_dataset_final (Balance, Geography_Spain) -0.1095 Pass
test_dataset_final (Geography_Spain, Gender_Male) 0.0916 Pass
2026-05-26 22:23:08,694 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document
validmind.model_validation.ModelMetadata

Model Metadata

The ModelMetadata test evaluates the metadata of different models by comparing high-level implementation attributes across models. The results are presented as a summary table covering modeling technique, modeling framework, framework version, and programming language for each model. In this test run, the table includes two models—log_model_champion and rf_model—with their metadata shown side by side for direct comparison.

Key insights:

  • Metadata is fully aligned across models: Both log_model_champion and rf_model are recorded with the same modeling technique (SKlearnModel), modeling framework (sklearn), framework version (1.8.0), and programming language (Python).
  • No cross-model version differences observed: The framework version is identical at 1.8.0 for both models, indicating no version variation in the metadata captured by this test.
  • Implementation stack is consistent: The shared use of the sklearn framework and Python language indicates that both models are documented under the same implementation environment.

The metadata comparison shows a fully consistent set of recorded implementation attributes across the two models included in the test. No differences are observed in modeling technique, framework, framework version, or programming language within the reported results. This indicates that the metadata captured for these models is uniform across the fields assessed by the test.

Tables

model Modeling Technique Modeling Framework Framework Version Programming Language
log_model_champion SKlearnModel sklearn 1.8.0 Python
rf_model SKlearnModel sklearn 1.8.0 Python
2026-05-26 22:23:14,962 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document
validmind.model_validation.sklearn.ModelParameters

Model Parameters

The Model Parameters test extracts and displays model configuration values to document the parameter settings used by each estimator. The results are presented as a structured table with model name, parameter name, and parameter value for two models: log_model_champion and rf_model. For log_model_champion, the table lists logistic regression settings including regularization, solver, and iteration controls; for rf_model, it lists random forest settings including ensemble size, split controls, feature sampling, and reproducibility-related parameters.

Key insights:

  • Two model configurations documented: The results capture parameter settings for log_model_champion and rf_model, providing a side-by-side record of the two estimator configurations included in the test output.
  • Logistic regression uses L1 regularization: log_model_champion is configured with penalty = l1, solver = liblinear, C = 1, and max_iter = 100, with fit_intercept = True and tol = 0.0001 also recorded.
  • Random forest uses 50 trees: rf_model is configured with n_estimators = 50, criterion = gini, max_features = sqrt, and bootstrap = True, defining the ensemble construction and split evaluation settings shown in the table.
  • Tree growth controls are minimally constrained: For rf_model, min_samples_leaf = 1, min_samples_split = 2, min_impurity_decrease = 0.0, and ccp_alpha = 0.0 are recorded, indicating the split and pruning parameters present in the extracted configuration.
  • Reproducibility parameters are explicit for random forest: rf_model includes random_state = 42, while both models show verbose = 0 and warm_start = False; rf_model also records oob_score = False.

The extracted parameter table provides a direct record of the configured settings for both the logistic regression and random forest models. The results show that log_model_champion is defined by an L1-regularized liblinear specification, while rf_model is defined by a 50-tree ensemble with gini splitting, sqrt feature sampling, and an explicit random seed. Collectively, the output documents the core training configuration fields available through the models’ parameter interfaces.

Tables

model Parameter Value
log_model_champion C 1
log_model_champion dual False
log_model_champion fit_intercept True
log_model_champion intercept_scaling 1
log_model_champion max_iter 100
log_model_champion penalty l1
log_model_champion solver liblinear
log_model_champion tol 0.0001
log_model_champion verbose 0
log_model_champion warm_start False
rf_model bootstrap True
rf_model ccp_alpha 0.0
rf_model criterion gini
rf_model max_features sqrt
rf_model min_impurity_decrease 0.0
rf_model min_samples_leaf 1
rf_model min_samples_split 2
rf_model min_weight_fraction_leaf 0.0
rf_model n_estimators 50
rf_model oob_score False
rf_model random_state 42
rf_model verbose 0
rf_model warm_start False
2026-05-26 22:23:22,966 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document
validmind.model_validation.sklearn.ROCCurve

ROC Curve

The ROCCurve test evaluates binary classification performance by plotting the Receiver Operating Characteristic curve and calculating the Area Under the Curve (AUC). The results show ROC curves for log_model_champion on both train_dataset_final and test_dataset_final, each compared against a random-classification reference line with AUC of 0.5. The reported AUC is 0.69 on the training dataset and 0.68 on the test dataset, and in both plots the model ROC curve remains above the random baseline across the threshold range.

Key insights:

  • Similar train and test AUCs: The AUC values are 0.69 on the training dataset and 0.68 on the test dataset. This indicates closely aligned discrimination performance across the two samples.
  • Performance above random baseline: In both figures, the ROC curve lies above the diagonal random reference line. The reported AUC values are also above 0.5 in both datasets.
  • Consistent ROC shape across datasets: The train and test ROC curves show similar progression, with gradual improvement in true positive rate as false positive rate increases. No material visual divergence is evident between the two plots.

The ROC results indicate that the model distinguishes between the two classes better than random classification on both the training and test datasets. Performance is similar across the two samples, with only a 0.01 difference in AUC. Taken together, the plots show stable discrimination characteristics between training and test evaluation.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:d2f9
ValidMind Figure validmind.model_validation.sklearn.ROCCurve:e65c
2026-05-26 22:23:35,762 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document
validmind.model_validation.sklearn.MinimumROCAUCScore

✅ Minimum ROCAUC Score

The Minimum ROC AUC Score test evaluates whether the model’s ROC AUC score meets or exceeds the specified minimum threshold on each evaluated dataset. The result table reports ROC AUC scores, the applied threshold, and pass/fail outcomes for both train_dataset_final and test_dataset_final. The observed scores are 0.6872 for the training dataset and 0.6812 for the test dataset, with a common threshold of 0.5, and both datasets are recorded as passing.

Key insights:

  • Both datasets passed the threshold: train_dataset_final and test_dataset_final both achieved ROC AUC scores above the minimum threshold of 0.5, resulting in a pass outcome for each dataset.
  • Scores are closely aligned across datasets: The ROC AUC score is 0.6872 on train_dataset_final and 0.6812 on test_dataset_final, a difference of 0.0060 between the two evaluated samples.
  • Test performance exceeds the minimum benchmark: The test dataset score of 0.6812 remains above the configured minimum threshold of 0.5 by 0.1812, while the training dataset exceeds the threshold by 0.1872.

The test results show that the model satisfied the minimum ROC AUC criterion on both the training and test datasets. Performance levels are similar across the two evaluated datasets, with only a small difference in observed scores. Within the scope of this test, the measured class discrimination exceeded the configured minimum benchmark in both cases.

Parameters:

{
  "min_threshold": 0.5
}
            

Tables

dataset Score Threshold Pass/Fail
train_dataset_final 0.6872 0.5 Pass
test_dataset_final 0.6812 0.5 Pass
2026-05-26 22:23:44,795 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document

In summary

In this final notebook, you learned how to:

With our ValidMind for validation series of notebooks, you learned how to validate a record (model) end-to-end with the ValidMind Library by running through some common scenarios in a typical validation setting:

  • Verifying the data quality steps performed by the development team
  • Independently replicating the champion's results and conducting additional tests to assess performance, stability, and robustness
  • Setting up test inputs and a challenger for comparative analysis
  • Running validation tests, analyzing results, and logging artifacts to ValidMind

Next steps

Work with your validation report

Now that you've logged all your test results and verified the work done by the development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:

  • Inserting additional test results: Click Link Evidence under any Evidence panel of 2. Validation in your validation report. (Learn more: Link evidence to reports)

  • Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)

  • Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage artifacts)

  • Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)

  • Assessing compliance: Under the Guideline for any validation report section, click ASSESSMENT and select the compliance status from the drop-down menu. (Learn more: Assign compliance assessments)

  • Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including developers. Propose suggested changes in the documentation, work with versioned history, and use comments to discuss specific portions of the documentation. (Learn more: Collaborate with others)

When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough validation history. (Learn more: Submit documents)

Learn more

Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining validation:

Use cases

Discover more learning resources

Learn more about the ValidMind Library tools we used in this notebook:

We also offer many interactive notebooks to help you use the ValidMind Library to streamline your work:

Or, visit our documentation to learn more about ValidMind.


Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial