ValidMind for model validation 4 — Finalize testing and reporting
Learn how to use ValidMind for your end-to-end model validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.
This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:
The function can be as simple or as complex as you need it to be — it can use external libraries, make API calls, or do anything else that you can do in Python.
The only requirement is that the function signature and return values can be "understood" and handled by the ValidMind Library. As such, custom tests offer added flexibility by extending the default tests provided by ValidMind, enabling you to document any type of model or use case.
For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.
Learn by doing
Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals
Prerequisites
In order to finalize validation and reporting, you'll need to first have:
Need help with the above steps?
Refer to the first three notebooks in this series:
# Make sure the ValidMind Library is installed%pip install -q validmind# Load your model identifier credentials from an `.env` file%load_ext dotenv%dotenv .env# Or replace with your code snippetimport validmind as vmvm.init(# api_host="...",# api_key="...",# api_secret="...",# model="...", document="validation-report",)
Note: you may need to restart the kernel to use updated packages.
2026-04-03 03:08:12,431 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report
Import the sample dataset
Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion model that we will independently preprocess:
# Load the sample datasetfrom validmind.datasets.classification import customer_churn as demo_datasetprint(f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}")raw_df = demo_dataset.load_data()
Loaded demo dataset with:
• Target column: 'Exited'
• Class labels: {'0': 'Did not exit', '1': 'Exited'}
# Initialize the raw dataset for use in ValidMind testsvm_raw_dataset = vm.init_dataset( dataset=raw_df, input_id="raw_dataset", target_column="Exited",)
import pandas as pdraw_copy_df = raw_df.sample(frac=1) # Create a copy of the raw dataset# Create a balanced dataset with the same number of exited and not exited customersexited_df = raw_copy_df.loc[raw_copy_df["Exited"] ==1]not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] ==0].sample(n=exited_df.shape[0])balanced_raw_df = pd.concat([exited_df, not_exited_df])balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)
Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:
# Register new data and now 'balanced_raw_dataset' is the new dataset object of interestvm_balanced_raw_dataset = vm.init_dataset( dataset=balanced_raw_df, input_id="balanced_raw_dataset", target_column="Exited",)
# Run HighPearsonCorrelation test with our balanced dataset as input and return a result objectcorr_result = vm.tests.run_test( test_id="validmind.data_validation.HighPearsonCorrelation", params={"max_threshold": 0.3}, inputs={"dataset": vm_balanced_raw_dataset},)
❌ High Pearson Correlation
The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity within the dataset. The results table presents the top ten strongest pairwise Pearson correlation coefficients, along with their corresponding Pass or Fail status based on a threshold of 0.3. Each row details the feature pair, the computed correlation coefficient, and whether the absolute value of the coefficient exceeds the threshold.
Key insights:
One feature pair exceeds correlation threshold: The pair (Age, Exited) shows a correlation coefficient of 0.345, surpassing the 0.3 threshold and resulting in a Fail status.
All other feature pairs below threshold: The remaining nine feature pairs have absolute correlation coefficients ranging from 0.041 to 0.188, all classified as Pass.
Predominantly weak linear relationships: Most feature pairs exhibit weak linear associations, with coefficients clustered well below the threshold.
The results indicate that, with the exception of the (Age, Exited) pair, feature pairs in the dataset do not display strong linear relationships. The overall correlation structure suggests limited risk of feature redundancy or multicollinearity, with only one pair warranting further attention due to its moderate correlation.
Parameters:
{
"max_threshold": 0.3
}
Tables
Columns
Coefficient
Pass/Fail
(Age, Exited)
0.3450
Fail
(IsActiveMember, Exited)
-0.1880
Pass
(Balance, NumOfProducts)
-0.1642
Pass
(Balance, Exited)
0.1516
Pass
(NumOfProducts, IsActiveMember)
0.0593
Pass
(NumOfProducts, Exited)
-0.0558
Pass
(Tenure, IsActiveMember)
-0.0550
Pass
(HasCrCard, IsActiveMember)
-0.0481
Pass
(Age, Balance)
0.0433
Pass
(Age, NumOfProducts)
-0.0410
Pass
# From result object, extract table from `corr_result.tables`features_df = corr_result.tables[0].datafeatures_df
Columns
Coefficient
Pass/Fail
0
(Age, Exited)
0.3450
Fail
1
(IsActiveMember, Exited)
-0.1880
Pass
2
(Balance, NumOfProducts)
-0.1642
Pass
3
(Balance, Exited)
0.1516
Pass
4
(NumOfProducts, IsActiveMember)
0.0593
Pass
5
(NumOfProducts, Exited)
-0.0558
Pass
6
(Tenure, IsActiveMember)
-0.0550
Pass
7
(HasCrCard, IsActiveMember)
-0.0481
Pass
8
(Age, Balance)
0.0433
Pass
9
(Age, NumOfProducts)
-0.0410
Pass
# Extract list of features that failed the testhigh_correlation_features = features_df[features_df["Pass/Fail"] =="Fail"]["Columns"].tolist()high_correlation_features
['(Age, Exited)']
# Extract feature names from the list of stringshigh_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]high_correlation_features
['Age']
# Remove the highly correlated features from the datasetbalanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)# Re-initialize the dataset objectvm_raw_dataset_preprocessed = vm.init_dataset( dataset=balanced_raw_no_age_df, input_id="raw_dataset_preprocessed", target_column="Exited",)
# Re-run the test with the reduced feature setcorr_result = vm.tests.run_test( test_id="validmind.data_validation.HighPearsonCorrelation", params={"max_threshold": 0.3}, inputs={"dataset": vm_raw_dataset_preprocessed},)
✅ High Pearson Correlation
The High Pearson Correlation test evaluates the linear relationships between feature pairs to identify potential redundancy or multicollinearity within the dataset. The results table presents the top ten absolute Pearson correlation coefficients, along with the corresponding feature pairs and Pass/Fail status based on a threshold of 0.3. All reported coefficients are below the threshold, and each feature pair is marked as Pass.
Key insights:
No feature pairs exceed correlation threshold: All absolute Pearson correlation coefficients are below the 0.3 threshold, with the highest observed value being 0.188 between IsActiveMember and Exited.
Low to moderate linear relationships: The strongest correlations observed are moderate in magnitude, with most coefficients clustered below 0.2, indicating limited linear association among the top feature pairs.
Consistent Pass status across all pairs: Every evaluated feature pair is marked as Pass, reflecting the absence of high linear dependencies within the top correlations.
The results indicate that the dataset does not exhibit high linear correlations among the evaluated feature pairs, suggesting minimal risk of feature redundancy or multicollinearity based on the Pearson correlation metric. The observed correlation structure supports the interpretability and stability of subsequent modeling efforts.
Parameters:
{
"max_threshold": 0.3
}
Tables
Columns
Coefficient
Pass/Fail
(IsActiveMember, Exited)
-0.1880
Pass
(Balance, NumOfProducts)
-0.1642
Pass
(Balance, Exited)
0.1516
Pass
(NumOfProducts, IsActiveMember)
0.0593
Pass
(NumOfProducts, Exited)
-0.0558
Pass
(Tenure, IsActiveMember)
-0.0550
Pass
(HasCrCard, IsActiveMember)
-0.0481
Pass
(CreditScore, Exited)
-0.0307
Pass
(Balance, IsActiveMember)
-0.0287
Pass
(CreditScore, EstimatedSalary)
-0.0246
Pass
Split the preprocessed dataset
With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:
# Encode categorical features in the datasetbalanced_raw_no_age_df = pd.get_dummies( balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True)balanced_raw_no_age_df.head()
CreditScore
Tenure
Balance
NumOfProducts
HasCrCard
IsActiveMember
EstimatedSalary
Exited
Geography_Germany
Geography_Spain
Gender_Male
4177
807
1
141069.18
3
1
1
194257.11
0
True
False
False
2877
588
10
129417.82
1
1
0
153727.32
0
True
False
False
6020
611
10
103294.56
1
1
0
160548.12
0
True
False
False
4764
668
10
110240.04
1
0
0
183980.56
1
False
False
False
4071
542
8
105770.14
1
0
1
140929.98
1
False
True
True
from sklearn.model_selection import train_test_split# Split the dataset into train and testtrain_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)X_train = train_df.drop("Exited", axis=1)y_train = train_df["Exited"]X_test = test_df.drop("Exited", axis=1)y_test = test_df["Exited"]
With our raw dataset assessed and preprocessed, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: lr_model_champion.pkl
# Import the champion modelimport pickle as pklwithopen("lr_model_champion.pkl", "rb") as f: log_reg = pkl.load(f)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:463: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.8.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
Train potential challenger model
We'll also train our random forest classification challenger model to see how it compares:
# Import the Random Forest Classification modelfrom sklearn.ensemble import RandomForestClassifier# Create the model instance with 50 decision treesrf_model = RandomForestClassifier( n_estimators=50, random_state=42,)# Train the modelrf_model.fit(X_train, y_train)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:
# Initialize the champion logistic regression modelvm_log_model = vm.init_model( log_reg, input_id="log_model_champion",)# Initialize the challenger random forest classification modelvm_rf_model = vm.init_model( rf_model, input_id="rf_model",)
# Assign predictions to Champion — Logistic regression modelvm_train_ds.assign_predictions(model=vm_log_model)vm_test_ds.assign_predictions(model=vm_log_model)# Assign predictions to Challenger — Random forest classification modelvm_train_ds.assign_predictions(model=vm_rf_model)vm_test_ds.assign_predictions(model=vm_rf_model)
2026-04-03 03:08:20,774 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-03 03:08:20,776 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-03 03:08:20,776 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-03 03:08:20,778 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-04-03 03:08:20,781 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-03 03:08:20,782 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-03 03:08:20,782 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-03 03:08:20,783 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-04-03 03:08:20,786 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-03 03:08:20,807 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-03 03:08:20,808 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-03 03:08:20,829 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-04-03 03:08:20,831 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-04-03 03:08:20,842 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-04-03 03:08:20,843 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-04-03 03:08:20,854 - INFO(validmind.vm_models.dataset.utils): Done running predict()
Implementing custom tests
Thanks to the model documentation (Learn more ...), we know that the model development team implemented a custom test to further evaluate the performance of the champion model.
In a usual model validation situation, you would load a saved custom test provided by the model development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.
Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the model development team used in their performance evaluations.
An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.
Create a confusion matrix plot
Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:
import matplotlib.pyplot as pltfrom sklearn import metrics# Get the predicted classesy_pred = log_reg.predict(vm_test_ds.x)confusion_matrix = metrics.confusion_matrix(y_test, y_pred)cm_display = metrics.ConfusionMatrixDisplay( confusion_matrix=confusion_matrix, display_labels=[False, True])cm_display.plot()
Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:
The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
VMDataset objects allow you to access the predictions for a given model by accessing the .y_pred() method.
The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.
@vm.test("my_custom_tests.ConfusionMatrix")def confusion_matrix(dataset, model):"""The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. The confusion matrix is a 2x2 table that contains 4 values: - True Positive (TP): the number of correct positive predictions - True Negative (TN): the number of correct negative predictions - False Positive (FP): the number of incorrect positive predictions - False Negative (FN): the number of incorrect negative predictions The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure. """ y_true = dataset.y y_pred = dataset.y_pred(model=model) confusion_matrix = metrics.confusion_matrix(y_true, y_pred) cm_display = metrics.ConfusionMatrixDisplay( confusion_matrix=confusion_matrix, display_labels=[False, True] ) cm_display.plot() plt.close() # close the plot to avoid displaying itreturn cm_display.figure_ # return the figure object itself
You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:
The Confusion Matrix test evaluates the classification performance of the model by comparing predicted labels to true labels for both the training and test datasets. The resulting matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of model prediction accuracy and error distribution. The results are presented separately for the train and test datasets, allowing for assessment of model consistency and generalization.
Key insights:
Balanced classification performance on training data: The training dataset confusion matrix shows 834 true negatives, 792 true positives, 458 false positives, and 501 false negatives, indicating the model captures both classes with a moderate level of misclassification.
Consistent error distribution on test data: The test dataset confusion matrix reports 214 true negatives, 217 true positives, 110 false positives, and 106 false negatives, reflecting a similar pattern of correct and incorrect predictions as observed in training.
Comparable rates of false positives and false negatives: Both datasets exhibit similar magnitudes of false positives and false negatives, suggesting the model does not disproportionately favor one class over the other in its misclassifications.
Generalization from train to test: The relative proportions of each confusion matrix cell remain stable between training and test datasets, indicating consistent model behavior and absence of significant overfitting.
The confusion matrix results demonstrate that the model maintains balanced predictive performance across both training and test datasets, with similar rates of correct and incorrect classifications for each class. The observed stability in error distribution between datasets indicates consistent generalization and no evidence of class imbalance in prediction errors.
Figures
2026-04-03 03:08:25,843 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document
The Confusion Matrix:challenger test evaluates the classification performance of the model by comparing predicted and true labels for both the training and test datasets. The resulting confusion matrices display the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of model accuracy and error distribution. The first matrix corresponds to the training dataset, while the second matrix summarizes results on the test dataset.
Key insights:
Perfect classification on training data: The training confusion matrix shows 1,292 true negatives and 1,292 true positives, with zero false positives and only one false negative, indicating near-perfect separation of classes during training.
Noticeable error rates on test data: The test confusion matrix records 217 true negatives, 229 true positives, 107 false positives, and 94 false negatives, reflecting a reduction in classification accuracy compared to training.
Balanced class representation: Both matrices display similar counts for positive and negative classes, suggesting balanced class distributions in both datasets.
The confusion matrix results indicate that the model achieves near-perfect classification on the training data, with minimal misclassification. However, performance on the test data shows a higher rate of both false positives and false negatives, suggesting a decrease in generalization capability outside the training sample. The balanced class counts across datasets support the reliability of these observations.
Figures
2026-04-03 03:08:30,671 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document
Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.
That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.
Add parameters to custom tests
Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:
@vm.test("my_custom_tests.ConfusionMatrix")def confusion_matrix(dataset, model, normalize=False):"""The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known. The confusion matrix is a 2x2 table that contains 4 values: - True Positive (TP): the number of correct positive predictions - True Negative (TN): the number of correct negative predictions - False Positive (FP): the number of incorrect positive predictions - False Negative (FN): the number of incorrect negative predictions The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure. """ y_true = dataset.y y_pred = dataset.y_pred(model=model)if normalize: confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")else: confusion_matrix = metrics.confusion_matrix(y_true, y_pred) cm_display = metrics.ConfusionMatrixDisplay( confusion_matrix=confusion_matrix, display_labels=[False, True] ) cm_display.plot() plt.close() # close the plot to avoid displaying itreturn cm_display.figure_ # return the figure object itself
Pass parameters to custom tests
You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.
The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
Since these are VMDataset or VMModel inputs, they have a special meaning.
Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:
# Champion with test dataset and normalize=Truevm.tests.run_test( test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion", input_grid={"dataset": [vm_test_ds],"model" : [vm_log_model] }, params={"normalize": True}).log()
Confusion Matrix Test Normalized Champion
The ConfusionMatrix:test_normalized_champion test evaluates the classification performance of the model by displaying the normalized proportions of true positives, true negatives, false positives, and false negatives. The resulting matrix provides a visual summary of prediction accuracy and error distribution for the test dataset, with each cell representing the fraction of total predictions for each outcome. The matrix is normalized such that the sum of all cells equals 1, facilitating direct comparison of error and correct classification rates.
Key insights:
Balanced correct classification rates: The model correctly classifies 0.33 of all samples as true negatives and 0.34 as true positives, indicating similar accuracy for both classes.
Comparable error rates for both classes: False positive and false negative rates are 0.17 and 0.16, respectively, showing that misclassification is distributed relatively evenly between the two error types.
No class dominance in prediction errors: The normalized confusion matrix does not indicate a substantial bias toward either false positives or false negatives.
The normalized confusion matrix demonstrates that the model achieves similar performance across both classes, with correct and incorrect predictions distributed evenly. The absence of pronounced class imbalance in error rates suggests consistent classification behavior, with no single error type disproportionately affecting model outcomes.
Parameters:
{
"normalize": true
}
Figures
2026-04-03 03:08:34,837 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document
# Challenger with test dataset and normalize=Truevm.tests.run_test( test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger", input_grid={"dataset": [vm_test_ds],"model" : [vm_rf_model] }, params={"normalize": True}).log()
Confusion Matrix Test Normalized Challenger
The ConfusionMatrix:test_normalized_challenger test evaluates the classification performance of the model by presenting the normalized confusion matrix for the test dataset. The matrix displays the proportion of true positives, true negatives, false positives, and false negatives, allowing for assessment of the model's predictive accuracy and error distribution. Each cell in the matrix represents the fraction of predictions for each true and predicted label combination, normalized over the total number of samples.
Key insights:
Balanced true positive and true negative rates: The model correctly predicts the negative class (True Negative) for 34% and the positive class (True Positive) for 35% of all cases, indicating similar accuracy for both classes.
Moderate false positive and false negative rates: False positives account for 17% and false negatives for 15% of predictions, reflecting a moderate level of misclassification in both directions.
No class dominance in prediction errors: The distribution of errors between false positives and false negatives is relatively even, suggesting the model does not disproportionately misclassify one class over the other.
The normalized confusion matrix indicates that the model achieves comparable accuracy for both positive and negative classes, with error rates distributed relatively evenly between false positives and false negatives. This balanced performance suggests the model does not exhibit a strong bias toward either class, and misclassification rates are moderate across both outcome types.
Parameters:
{
"normalize": true
}
Figures
2026-04-03 03:08:39,188 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document
Use external test providers
Sometimes you may want to reuse the same set of custom tests across multiple models and share them with others in your organization, like the model development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.
In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:
Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.
The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:
tests_folder ="my_tests"import os# create tests folderos.makedirs(tests_folder, exist_ok=True)# remove existing testsfor f in os.listdir(tests_folder):# remove files and pycacheif f.endswith(".py") or f =="__pycache__": os.system(f"rm -rf {tests_folder}/{f}")
After running the command above, confirm that a new my_tests directory was created successfully. For example:
~/notebooks/tutorials/model_validation/my_tests/
Save an inline test
The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.
While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.
The confusion_matrix test requires the following additional imports:
import matplotlib.pyplot as pltfrom sklearn import metrics
Let's pass these imports to the save() method to ensure they are included in the file with the following command:
confusion_matrix.save(# Save it to the custom tests folder we created tests_folder, imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],)
2026-04-03 03:08:39,582 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/model_validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-04-03 03:08:39,582 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix
# Saved from __main__.confusion_matrix
# Original Test ID: my_custom_tests.ConfusionMatrix
# New Test ID: <test_provider_namespace>.ConfusionMatrix
Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:
ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.
For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.
The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in model documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.
Let's go ahead and load the custom tests from our my_tests directory:
from validmind.tests import LocalTestProvider# initialize the test provider with the tests folder we created earliermy_test_provider = LocalTestProvider(tests_folder)vm.tests.register_test_provider( namespace="my_test_provider", test_provider=my_test_provider,)# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file
Run test provider tests
Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:
For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.
Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.
# Champion with test dataset and test provider custom testvm.tests.run_test( test_id="my_test_provider.ConfusionMatrix:champion", input_grid={"dataset": [vm_test_ds],"model" : [vm_log_model] }).log()
Confusion Matrix Champion
The Confusion Matrix test evaluates the classification performance of the model by comparing predicted labels against true labels on the test dataset. The resulting matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of model prediction accuracy and error types. The matrix for the champion model on the test dataset shows the distribution of correct and incorrect predictions across both classes.
Key insights:
Balanced correct classification across classes: The model correctly classified 214 negative cases (true negatives) and 217 positive cases (true positives), indicating similar accuracy for both classes.
Comparable error rates for both classes: The number of false positives (110) and false negatives (106) are closely matched, suggesting that the model's misclassification rates are similar for both positive and negative classes.
Substantial proportion of correct predictions: The sum of true positives and true negatives (431) constitutes a majority of the total predictions, reflecting a substantial correct classification rate.
The confusion matrix indicates that the model demonstrates balanced performance in distinguishing between positive and negative classes, with similar rates of correct and incorrect predictions for each class. The distribution of errors does not show a pronounced bias toward either class, supporting the overall consistency of the model's classification behavior on the test dataset.
Figures
2026-04-03 03:08:43,191 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document
# Challenger with test dataset and test provider custom testvm.tests.run_test( test_id="my_test_provider.ConfusionMatrix:challenger", input_grid={"dataset": [vm_test_ds],"model" : [vm_rf_model] }).log()
Confusion Matrix Challenger
The Confusion Matrix:challenger test evaluates the classification performance of the model by comparing predicted labels to true labels on the test dataset. The resulting confusion matrix displays the counts of true positives, true negatives, false positives, and false negatives, providing a comprehensive view of the model's prediction accuracy and error distribution. The matrix is structured with true labels on the vertical axis and predicted labels on the horizontal axis, with each cell indicating the count of observations for each outcome.
Key insights:
Higher true positive and true negative counts: The model correctly classified 229 true positives and 217 true negatives, indicating strong identification of both classes.
Moderate false positive and false negative rates: There are 107 false positives and 94 false negatives, reflecting a moderate level of misclassification for both classes.
Balanced error distribution: The counts of false positives and false negatives are similar in magnitude, suggesting that misclassification is not heavily skewed toward one class.
The confusion matrix reveals that the model demonstrates effective classification performance, with higher counts of correct predictions for both positive and negative classes. The distribution of errors is relatively balanced, with moderate rates of both false positives and false negatives. This indicates that the model maintains a consistent ability to distinguish between classes, though some misclassification persists across both outcomes.
Figures
2026-04-03 03:08:49,252 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document
Verify test runs
Our final task is to verify that all the tests provided by the model development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.
Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:
for t in test_config:print(t)try:# Check if test has input_gridif'input_grid'in test_config[t]:# For tests with input_grid, pass the input_grid configurationif'params'in test_config[t]: vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()else: vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()else:# Original logic for regular inputsif'params'in test_config[t]: vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()else: vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()exceptExceptionas e:print(f"Error running test {t}: {str(e)}")
The Dataset Description test provides a comprehensive summary of the dataset columns, including data types, counts, missing values, and the number of distinct values for each feature. The results table presents these statistics for all columns used in the model, covering both numerical and categorical variables. All columns are fully populated, with no missing values, and the distinct value counts are reported for each feature, offering insight into the granularity and diversity of the data.
Key insights:
No missing values across all columns: All 11 columns have 0 missing entries, resulting in 0% missingness throughout the dataset.
High cardinality in select numeric features: The Balance and EstimatedSalary columns exhibit high distinct value counts (5,088 and 8,000 respectively), indicating a wide range of unique values.
Low cardinality in categorical features: Categorical columns such as Geography, Gender, HasCrCard, IsActiveMember, and Exited have between 2 and 3 distinct values, reflecting limited category diversity.
Consistent record count across features: Each column contains 8,000 entries, confirming dataset completeness and alignment across all features.
The dataset is fully complete with no missing data, ensuring robust input coverage for model development and evaluation. Numeric features display varying levels of granularity, with some columns containing a large number of unique values, while categorical features remain low in cardinality. The absence of missing values and the consistent record count across all columns support reliable downstream modeling and analysis.
Tables
Dataset Description
Name
Type
Count
Missing
Missing %
Distinct
Distinct %
CreditScore
Numeric
8000.0
0
0.0
452
0.0565
Geography
Categorical
8000.0
0
0.0
3
0.0004
Gender
Categorical
8000.0
0
0.0
2
0.0002
Age
Numeric
8000.0
0
0.0
69
0.0086
Tenure
Numeric
8000.0
0
0.0
11
0.0014
Balance
Numeric
8000.0
0
0.0
5088
0.6360
NumOfProducts
Numeric
8000.0
0
0.0
4
0.0005
HasCrCard
Categorical
8000.0
0
0.0
2
0.0002
IsActiveMember
Categorical
8000.0
0
0.0
2
0.0002
EstimatedSalary
Numeric
8000.0
0
0.0
8000
1.0000
Exited
Categorical
8000.0
0
0.0
2
0.0002
2026-04-03 03:08:53,272 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics of both numerical and categorical variables in the dataset. The results present summary statistics for eight numerical variables, including measures of central tendency, dispersion, and range, as well as frequency-based summaries for two categorical variables. The numerical table details count, mean, standard deviation, and percentiles, while the categorical table provides counts, unique value counts, and the dominance of the most frequent category. These results provide a comprehensive overview of the dataset's structure and variable distributions.
Key insights:
Wide range and skewness in balance and salary: The Balance variable has a mean of 76,434 and a median of 97,264, with a minimum of 0 and a maximum of 250,898, indicating a right-skewed distribution. EstimatedSalary also shows a broad range, with a mean of 99,790, a median of 99,505, and a maximum of 199,992.
CreditScore and Age distributions are symmetric: CreditScore and Age have means (650.16 and 38.95, respectively) closely aligned with their medians (652.0 and 37.0), suggesting relatively symmetric distributions.
Binary variables show balanced representation: HasCrCard and IsActiveMember have means of 0.70 and 0.52, respectively, indicating a balanced split between categories.
Categorical dominance in Geography and Gender: France is the most frequent Geography (50.12%), and Male is the most frequent Gender (54.95%), indicating moderate dominance but not extreme concentration.
No missing data detected: All variables report a count of 8,000, indicating complete data coverage for all fields.
The dataset exhibits a mix of symmetric and skewed distributions among numerical variables, with Balance and EstimatedSalary showing substantial right skew and wide value ranges. Categorical variables display moderate dominance of single categories but retain diversity, as evidenced by multiple unique values. The absence of missing data supports data completeness, and the overall distributional characteristics provide a clear foundation for further model analysis and validation.
Tables
Numerical Variables
Name
Count
Mean
Std
Min
25%
50%
75%
90%
95%
Max
CreditScore
8000.0
650.1596
96.8462
350.0
583.0
652.0
717.0
778.0
813.0
850.0
Age
8000.0
38.9489
10.4590
18.0
32.0
37.0
44.0
53.0
60.0
92.0
Tenure
8000.0
5.0339
2.8853
0.0
3.0
5.0
8.0
9.0
9.0
10.0
Balance
8000.0
76434.0965
62612.2513
0.0
0.0
97264.0
128045.0
149545.0
162488.0
250898.0
NumOfProducts
8000.0
1.5325
0.5805
1.0
1.0
1.0
2.0
2.0
2.0
4.0
HasCrCard
8000.0
0.7026
0.4571
0.0
0.0
1.0
1.0
1.0
1.0
1.0
IsActiveMember
8000.0
0.5199
0.4996
0.0
0.0
1.0
1.0
1.0
1.0
1.0
EstimatedSalary
8000.0
99790.1880
57520.5089
12.0
50857.0
99505.0
149216.0
179486.0
189997.0
199992.0
Categorical Variables
Name
Count
Number of Unique Values
Top Value
Top Value Frequency
Top Value Frequency %
Geography
8000.0
3.0
France
4010.0
50.12
Gender
8000.0
2.0
Male
4396.0
54.95
2026-04-03 03:08:58,493 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document
validmind.data_validation.MissingValues:raw_data
✅ Missing Values Raw Data
The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the 1.0% threshold. All features are listed with their respective missing value statistics and test outcomes.
Key insights:
No missing values detected: All features report zero missing values, with both the number and percentage of missing values recorded as 0.0%.
Universal pass across features: Every feature meets the missing value threshold criterion, resulting in a "Pass" status for all columns.
The dataset demonstrates complete data integrity with respect to missing values, as no feature contains any missing entries. This outcome indicates a high level of data completeness, supporting reliable downstream modeling and analysis.
Parameters:
{
"min_percentage_threshold": 1
}
Tables
Column
Number of Missing Values
Percentage of Missing Values (%)
Pass/Fail
CreditScore
0
0.0
Pass
Geography
0
0.0
Pass
Gender
0
0.0
Pass
Age
0
0.0
Pass
Tenure
0
0.0
Pass
Balance
0
0.0
Pass
NumOfProducts
0
0.0
Pass
HasCrCard
0
0.0
Pass
IsActiveMember
0
0.0
Pass
EstimatedSalary
0
0.0
Pass
Exited
0
0.0
Pass
2026-04-03 03:09:00,746 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document
validmind.data_validation.ClassImbalance:raw_data
✅ Class Imbalance Raw Data
The Class Imbalance test evaluates the distribution of target classes in the dataset to identify potential imbalances that could affect model performance. The results table presents the percentage of records for each class in the target variable "Exited," alongside a pass/fail assessment based on a minimum threshold of 10%. The accompanying bar plot visually displays the proportion of each class, with class 0 and class 1 represented according to their observed frequencies.
Key insights:
Both classes exceed minimum threshold: Class 0 constitutes 79.80% and class 1 constitutes 20.20% of the dataset, with both surpassing the 10% minimum threshold.
No high-risk class imbalance detected: Both classes are marked as "Pass," indicating that neither class is under-represented according to the test criterion.
Majority class dominance observed: Class 0 is the majority class, representing nearly four times the proportion of class 1, as visualized in the bar plot.
The dataset demonstrates sufficient representation for both classes under the defined threshold, with no class flagged for high-risk imbalance. While class 0 is the dominant class, the distribution meets the test's criteria for balanced class representation, supporting reliable model training and evaluation with respect to class frequency.
Parameters:
{
"min_percent_threshold": 10
}
Tables
Exited Class Imbalance
Exited
Percentage of Rows (%)
Pass/Fail
0
79.80%
Pass
1
20.20%
Pass
Figures
2026-04-03 03:09:06,556 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document
validmind.data_validation.Duplicates:raw_data
✅ Duplicates Raw Data
The Duplicates:raw_data test evaluates the presence of duplicate rows in the dataset to ensure data quality and reduce the risk of model overfitting due to redundant information. The results table presents the absolute number and percentage of duplicate rows identified in the dataset, with the test configured to flag results only if the count exceeds a minimum threshold. The table indicates both the number of duplicates and their proportion relative to the total dataset size.
Key insights:
No duplicate rows detected: The dataset contains zero duplicate rows, as indicated by a "Number of Duplicates" value of 0.
Zero percent duplication rate: The "Percentage of Rows (%)" is 0.0%, confirming the absence of redundant entries.
The results demonstrate that the dataset is free from duplicate rows, indicating a high level of data integrity with respect to redundancy. This supports the reliability of subsequent model training and evaluation by minimizing the risk of bias or overfitting associated with duplicate data.
Parameters:
{
"min_threshold": 1
}
Tables
Duplicate Rows Results for Dataset
Number of Duplicates
Percentage of Rows (%)
0
0.0
2026-04-03 03:09:09,176 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document
The High Cardinality test evaluates the number of unique values in categorical columns to identify potential risks of overfitting and data noise. The results table presents the number and percentage of distinct values for each categorical column, along with a pass/fail status based on a threshold of 10% distinct values. Both "Geography" and "Gender" columns are assessed, with their respective distinct value counts and percentages reported.
Key insights:
All categorical columns pass cardinality threshold: Both "Geography" (3 distinct values, 0.0375%) and "Gender" (2 distinct values, 0.025%) are well below the 10% threshold, resulting in a "Pass" status for each.
Low cardinality observed across features: The number of unique values in all evaluated categorical columns remains minimal, indicating limited diversity within these features.
The results indicate that all assessed categorical columns exhibit low cardinality, with distinct value counts and percentages substantially below the defined threshold. No evidence of high cardinality or associated overfitting risk is present in the evaluated features.
2026-04-03 03:09:11,749 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document
validmind.data_validation.Skewness:raw_data
❌ Skewness Raw Data
The Skewness:raw_data test evaluates the asymmetry of numerical feature distributions to identify deviations from normality that may impact model performance. The results table presents skewness values for each numeric column, with a pass/fail indicator based on whether the absolute skewness exceeds the threshold of 1. Skewness values range from -0.89 to 1.48 across the dataset, with most features passing the threshold, while two features exceed it and are flagged as failing.
Key insights:
Majority of features within skewness threshold: Six out of eight numeric features exhibit skewness values between -0.89 and 0.72, all passing the defined threshold of 1.
Age and Exited exceed skewness threshold: The Age feature (skewness = 1.0245) and Exited feature (skewness = 1.4847) both exceed the maximum threshold, resulting in a fail status for these columns.
Minimal skewness in core financial variables: CreditScore, Balance, and EstimatedSalary display skewness values close to zero, indicating near-symmetric distributions.
The skewness assessment reveals that most numeric features maintain distributions close to symmetric, with only Age and Exited exhibiting substantial positive skewness beyond the defined threshold. The results indicate localized asymmetry in these two features, while the remainder of the dataset demonstrates distributional balance.
Parameters:
{
"max_threshold": 1
}
Tables
Skewness Results for Dataset
Column
Skewness
Pass/Fail
CreditScore
-0.0620
Pass
Age
1.0245
Fail
Tenure
0.0077
Pass
Balance
-0.1353
Pass
NumOfProducts
0.7172
Pass
HasCrCard
-0.8867
Pass
IsActiveMember
-0.0796
Pass
EstimatedSalary
0.0095
Pass
Exited
1.4847
Fail
2026-04-03 03:09:14,610 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document
validmind.data_validation.UniqueRows:raw_data
❌ Unique Rows Raw Data
The UniqueRows test evaluates the diversity of the dataset by measuring the proportion of unique values in each column relative to the total row count, with a minimum threshold set at 1%. The results table presents, for each column, the number and percentage of unique values, along with a pass/fail outcome based on whether the uniqueness percentage meets or exceeds the threshold. Columns such as EstimatedSalary, Balance, and CreditScore exceed the threshold and pass, while most categorical and low-cardinality columns fall below the threshold and fail.
Key insights:
High uniqueness in continuous variables: EstimatedSalary (100%), Balance (63.6%), and CreditScore (5.65%) all exceed the 1% uniqueness threshold, resulting in a pass for these columns.
Low uniqueness in categorical variables: Columns such as Geography (0.0375%), Gender (0.025%), HasCrCard (0.025%), IsActiveMember (0.025%), and Exited (0.025%) have very low percentages of unique values and fail the test.
Age and Tenure show moderate to low uniqueness: Age (0.8625%) and Tenure (0.1375%) both fall below the threshold, resulting in a fail despite being numeric variables.
Majority of columns fail uniqueness threshold: Only 3 out of 11 columns meet the minimum uniqueness requirement, with the remaining 8 columns failing.
The results indicate that while continuous variables such as EstimatedSalary, Balance, and CreditScore demonstrate sufficient diversity, the majority of columns—particularly those representing categorical or low-cardinality features—do not meet the minimum uniqueness threshold. This pattern reflects the inherent structure of the dataset, where categorical variables naturally exhibit limited unique values. The overall data composition is characterized by high uniqueness in select continuous features and low uniqueness in most categorical and discrete variables.
Parameters:
{
"min_percent_threshold": 1
}
Tables
Column
Number of Unique Values
Percentage of Unique Values (%)
Pass/Fail
CreditScore
452
5.6500
Pass
Geography
3
0.0375
Fail
Gender
2
0.0250
Fail
Age
69
0.8625
Fail
Tenure
11
0.1375
Fail
Balance
5088
63.6000
Pass
NumOfProducts
4
0.0500
Fail
HasCrCard
2
0.0250
Fail
IsActiveMember
2
0.0250
Fail
EstimatedSalary
8000
100.0000
Pass
Exited
2
0.0250
Fail
2026-04-03 03:09:19,202 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document
The TooManyZeroValues test identifies numerical columns with a proportion of zero values exceeding a defined threshold, set here at 0.03%. The results table summarizes the number and percentage of zero values for each numerical variable, along with a pass/fail status based on the threshold. All four evaluated variables—Tenure, Balance, HasCrCard, and IsActiveMember—are reported with their respective row counts and zero value statistics.
Key insights:
All variables exceed zero value threshold: Each of the four numerical columns tested has a percentage of zero values significantly above the 0.03% threshold, resulting in a fail status for all.
High zero prevalence in IsActiveMember and Balance: IsActiveMember has the highest proportion of zero values at 48.01%, followed by Balance at 36.4%, indicating substantial sparsity in these features.
Substantial zero values in HasCrCard and Tenure: HasCrCard and Tenure also display elevated zero value rates at 29.74% and 4.04%, respectively, both well above the threshold.
The test results indicate that all assessed numerical variables contain a high proportion of zero values relative to the defined threshold. This widespread presence of zeros may reflect underlying data sparsity or limited variation in these features. The findings highlight the need for careful consideration of these variables in subsequent modeling steps, as their distributional characteristics could influence model performance and interpretability.
Parameters:
{
"max_percent_threshold": 0.03
}
Tables
Variable
Row Count
Number of Zero Values
Percentage of Zero Values (%)
Pass/Fail
Tenure
8000
323
4.0375
Fail
Balance
8000
2912
36.4000
Fail
HasCrCard
8000
2379
29.7375
Fail
IsActiveMember
8000
3841
48.0125
Fail
2026-04-03 03:09:22,376 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document
The Interquartile Range Outliers Table (IQROutliersTable) test identifies and summarizes outliers in numerical features using the IQR method, with a threshold parameter set to 5. The results are presented in a summary table that would list the number and distribution of outliers for each numerical feature. In this instance, the result table is empty, indicating no outliers were detected in any numerical feature according to the specified threshold.
Key insights:
No outliers detected in numerical features: The summary table contains no entries, indicating that none of the numerical features in the dataset have values classified as outliers under the IQR method with a threshold of 5.
Uniform data distribution across features: The absence of outliers suggests that all numerical features fall within the expected range defined by the IQR criteria, with no extreme values present.
The results indicate that the dataset's numerical features exhibit a uniform distribution without extreme values exceeding the IQR-based outlier threshold. This suggests a high degree of data consistency and absence of anomalous values under the current test configuration.
Parameters:
{
"threshold": 5
}
Tables
Summary of Outliers Detected by IQR Method
2026-04-03 03:09:25,335 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics of both numerical and categorical variables in the preprocessed dataset. The results present summary statistics for seven numerical variables, including measures of central tendency, dispersion, and range, as well as frequency-based summaries for two categorical variables. The tables provide a comprehensive overview of the dataset’s structure, highlighting the spread, central values, and dominant categories for each variable.
Key insights:
Wide range and skewness in Balance: The Balance variable exhibits a minimum of 0.0, a median of 103,293.0, and a maximum of 250,898.0, with a mean (81,726.2) notably lower than the median, indicating right-skewness and a substantial proportion of zero balances.
CreditScore distribution is symmetric and complete: CreditScore shows a mean (648.3) closely aligned with the median (650.0), with values spanning from 350.0 to 850.0 and no missing data.
Binary variables show balanced representation: HasCrCard and IsActiveMember are binary, with HasCrCard having 70.2% of entries as 1 and IsActiveMember at 46.3%, indicating no extreme imbalance.
Categorical dominance in Geography and Gender: France is the most frequent Geography (46.35%), and Male is the most frequent Gender (50.8%), with both variables showing moderate diversity (three and two unique values, respectively).
No missing data detected: All variables report a count of 3,232, matching the dataset size, indicating complete data coverage.
The dataset demonstrates comprehensive coverage and completeness across all variables, with numerical features generally exhibiting expected ranges and central tendencies. The Balance variable displays pronounced right-skewness and a high proportion of zero values, while categorical variables show moderate diversity with some category dominance. Binary variables are well-represented without significant imbalance. Overall, the data structure supports robust analysis, with key distributional characteristics clearly delineated.
Tables
Numerical Variables
Name
Count
Mean
Std
Min
25%
50%
75%
90%
95%
Max
CreditScore
3232.0
648.2732
98.4944
350.0
580.0
650.0
715.0
778.0
815.0
850.0
Tenure
3232.0
5.0059
2.8881
0.0
3.0
5.0
8.0
9.0
10.0
10.0
Balance
3232.0
81726.2286
61509.2108
0.0
0.0
103293.0
129067.0
150193.0
163649.0
250898.0
NumOfProducts
3232.0
1.5096
0.6712
1.0
1.0
1.0
2.0
2.0
3.0
4.0
HasCrCard
3232.0
0.7024
0.4573
0.0
0.0
1.0
1.0
1.0
1.0
1.0
IsActiveMember
3232.0
0.4632
0.4987
0.0
0.0
0.0
1.0
1.0
1.0
1.0
EstimatedSalary
3232.0
100564.6734
57273.4954
12.0
52422.0
100935.0
149483.0
179264.0
189999.0
199953.0
Categorical Variables
Name
Count
Number of Unique Values
Top Value
Top Value Frequency
Top Value Frequency %
Geography
3232.0
3.0
France
1498.0
46.35
Gender
3232.0
2.0
Male
1642.0
50.80
2026-04-03 03:09:29,707 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document
The Tabular Description test evaluates the distributional characteristics, completeness, and data types of numerical and categorical variables in the preprocessed dataset. The results present summary statistics for each variable, including measures of central tendency, range, missingness, and unique value counts. All variables are reported with their respective data types and observed value ranges, providing a comprehensive overview of the dataset's structure and integrity.
Key insights:
No missing values detected: All numerical and categorical variables report 0.0% missing values, indicating complete data coverage across the dataset.
Consistent data types across variables: Numerical variables are represented as int64 or float64, while categorical variables are of object type, aligning with standard data representations.
Balanced binary target variable: The 'Exited' variable has a mean of 0.5, with minimum and maximum values of 0 and 1, indicating an even split between classes.
Limited categorical diversity: 'Geography' contains three unique values (Germany, France, Spain), and 'Gender' contains two unique values (Female, Male), reflecting low cardinality in categorical features.
Wide range in numerical features: Variables such as 'CreditScore', 'Balance', and 'EstimatedSalary' exhibit broad value ranges, with 'CreditScore' spanning from 350 to 850 and 'EstimatedSalary' from 11.58 to 199,953.33.
The dataset demonstrates high data integrity, with complete records and appropriate data types for all variables. The balanced distribution of the target variable and the limited number of unique values in categorical features provide a clear structure for downstream modeling. The observed value ranges in numerical variables indicate sufficient variability for predictive modeling, while the absence of missing data reduces the risk of data quality issues.
Tables
Numerical Variable
Num of Obs
Mean
Min
Max
Missing Values (%)
Data Type
CreditScore
3232
648.2732
350.00
850.00
0.0
int64
Tenure
3232
5.0059
0.00
10.00
0.0
int64
Balance
3232
81726.2286
0.00
250898.09
0.0
float64
NumOfProducts
3232
1.5096
1.00
4.00
0.0
int64
HasCrCard
3232
0.7024
0.00
1.00
0.0
int64
IsActiveMember
3232
0.4632
0.00
1.00
0.0
int64
EstimatedSalary
3232
100564.6734
11.58
199953.33
0.0
float64
Exited
3232
0.5000
0.00
1.00
0.0
int64
Categorical Variable
Num of Obs
Num of Unique Values
Unique Values
Missing Values (%)
Data Type
Geography
3232.0
3.0
['Germany' 'France' 'Spain']
0.0
object
Gender
3232.0
2.0
['Female' 'Male']
0.0
object
2026-04-03 03:09:33,270 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document
The Missing Values test evaluates dataset quality by measuring the proportion of missing values in each feature and comparing it to a predefined threshold. The results table presents, for each column, the number and percentage of missing values, along with a Pass/Fail status based on whether the missingness exceeds the 1.0% threshold. All features in the preprocessed dataset are included in the assessment, with missingness percentages and test outcomes clearly indicated.
Key insights:
No missing values detected: All features, including CreditScore, Geography, Gender, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, have 0 missing values.
All features pass threshold criteria: Each feature registers a missing value percentage of 0.0%, resulting in a Pass status for all columns under the 1.0% threshold.
The dataset demonstrates complete data integrity with respect to missing values, as all features contain 0% missingness and meet the established quality threshold. This indicates a high level of data completeness in the preprocessed dataset, supporting reliable downstream modeling and analysis.
Parameters:
{
"min_percentage_threshold": 1
}
Tables
Column
Number of Missing Values
Percentage of Missing Values (%)
Pass/Fail
CreditScore
0
0.0
Pass
Geography
0
0.0
Pass
Gender
0
0.0
Pass
Tenure
0
0.0
Pass
Balance
0
0.0
Pass
NumOfProducts
0
0.0
Pass
HasCrCard
0
0.0
Pass
IsActiveMember
0
0.0
Pass
EstimatedSalary
0
0.0
Pass
Exited
0
0.0
Pass
2026-04-03 03:09:36,943 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document
The TabularNumericalHistograms:preprocessed_data test provides a visual assessment of the distribution of each numerical feature in the dataset by generating histograms. The resulting plots display the frequency distribution for each variable, enabling identification of skewness, outliers, and other distributional characteristics. These visualizations facilitate an understanding of the underlying data structure prior to model training or evaluation.
Key insights:
CreditScore distribution is unimodal and right-skewed: The CreditScore histogram shows a single peak between 600 and 700, with a longer tail extending toward higher values, indicating right skewness and a concentration of scores in the mid-to-high range.
Tenure is uniformly distributed with edge effects: The Tenure variable displays a near-uniform distribution across most values, with lower frequencies at the minimum and maximum ends, suggesting even representation except at the boundaries.
Balance exhibits a strong zero-inflation and central peak: The Balance histogram reveals a large spike at zero, followed by a bell-shaped distribution centered around 120,000, indicating a substantial proportion of zero balances and a concentrated nonzero range.
NumOfProducts is highly concentrated at lower values: The NumOfProducts feature is dominated by the value 1, with rapidly decreasing frequencies for higher product counts, indicating most customers hold a single product.
HasCrCard and IsActiveMember are binary with class imbalance: Both HasCrCard and IsActiveMember show binary distributions, with HasCrCard skewed toward 1 and IsActiveMember more evenly split but still showing a higher count for 0.
EstimatedSalary is approximately uniform: The EstimatedSalary histogram is relatively flat across the range, indicating an even distribution of salary values without pronounced skewness or clustering.
The histograms reveal a range of distributional patterns across numerical features, including right skewness, zero-inflation, and class imbalances. These characteristics highlight the presence of concentrated values and potential outliers in several variables, which may influence model behavior and warrant consideration in subsequent analysis. The visualizations provide a comprehensive overview of input data structure, supporting further assessment of data quality and suitability for modeling.
Figures
2026-04-03 03:09:44,712 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document
The TabularCategoricalBarPlots:preprocessed_data test evaluates the distribution of categorical variables in the dataset by generating bar plots for each category within these features. The resulting plots display the frequency counts for each category in the "Geography" and "Gender" variables, providing a visual summary of the dataset's categorical composition. This enables assessment of category balance and identification of any potential imbalances or underrepresented groups.
Key insights:
Geography distribution is imbalanced: The "Geography" variable shows France as the most represented category, followed by Germany and then Spain, with France having a notably higher count than the other two.
Gender distribution is balanced: The "Gender" variable displays similar counts for both "Male" and "Female" categories, indicating no significant imbalance between genders.
The categorical composition of the dataset reveals a pronounced imbalance in the "Geography" variable, with France being overrepresented relative to Germany and Spain. In contrast, the "Gender" variable demonstrates a balanced distribution across categories. These observations provide a clear view of the dataset's categorical structure and highlight areas where category representation may influence model behavior.
Figures
2026-04-03 03:09:50,005 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document
The Target Rate Bar Plots test visualizes the distribution and target rates of categorical features to provide insight into the model’s classification patterns. The results display paired bar plots for each categorical variable, with the left plot showing the frequency of each category and the right plot illustrating the mean target rate for each category as derived from the default column. The features analyzed include Geography and Gender, with each category’s sample count and corresponding target rate presented side by side.
Key insights:
Distinct target rate variation by Geography: The target rate for Germany is notably higher than for France and Spain, with Germany exceeding 0.6 while France and Spain are both near 0.43.
Balanced sample counts across Gender: Male and Female categories have nearly equal representation, each with counts above 1,500.
Gender-based target rate difference: The target rate for Female is higher (above 0.5) compared to Male (below 0.45), indicating a measurable difference in positive class proportion between genders.
Uneven category representation in Geography: France has the highest sample count, followed by Germany and then Spain, with Spain having the lowest representation.
The results reveal that both Geography and Gender exhibit substantial differences in target rates across their respective categories, with Germany and Female categories showing elevated target rates relative to their peers. Sample counts are well balanced for Gender but show more pronounced variation for Geography. These patterns highlight areas where model predictions and underlying data distributions diverge across categorical groups.
Figures
2026-04-03 03:09:55,678 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics of numerical variables in the development (train) and test datasets. The results present summary statistics including count, mean, standard deviation, minimum, percentiles, and maximum for each variable. These statistics provide a quantitative overview of the central tendency, dispersion, and range for each feature, enabling assessment of data quality and potential risk factors.
Key insights:
Consistent central tendencies across datasets: Means and medians for key variables such as CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary are closely aligned between the train and test datasets, indicating stable data distributions.
Wide range and high variance in Balance and EstimatedSalary: Both Balance and EstimatedSalary exhibit large standard deviations relative to their means, with Balance ranging from 0 to over 222,000 in train and up to 250,898 in test, and EstimatedSalary spanning from low hundreds to nearly 200,000.
Binary variables show expected distributions: HasCrCard and IsActiveMember display means near 0.7 and 0.46, respectively, with medians matching the most frequent values, reflecting their binary nature and balanced representation.
No missing data detected: All variables report counts equal to the total number of records in their respective datasets, indicating complete data coverage for the analyzed features.
The descriptive statistics indicate that the numerical variables in both the train and test datasets are well-aligned, with similar central tendencies and dispersion measures. High variance in Balance and EstimatedSalary reflects substantial heterogeneity in these financial attributes, while binary variables maintain balanced distributions. The absence of missing data supports data integrity for subsequent modeling and analysis.
Tables
dataset
Name
Count
Mean
Std
Min
25%
50%
75%
90%
95%
Max
train_dataset_final
CreditScore
2585.0
649.2371
99.7791
350.0
581.0
650.0
718.0
781.0
820.0
850.0
train_dataset_final
Tenure
2585.0
5.0193
2.8830
0.0
3.0
5.0
8.0
9.0
10.0
10.0
train_dataset_final
Balance
2585.0
81144.9986
61547.3226
0.0
0.0
102967.0
128915.0
150078.0
163147.0
222268.0
train_dataset_final
NumOfProducts
2585.0
1.5157
0.6641
1.0
1.0
1.0
2.0
2.0
3.0
4.0
train_dataset_final
HasCrCard
2585.0
0.7041
0.4566
0.0
0.0
1.0
1.0
1.0
1.0
1.0
train_dataset_final
IsActiveMember
2585.0
0.4654
0.4989
0.0
0.0
0.0
1.0
1.0
1.0
1.0
train_dataset_final
EstimatedSalary
2585.0
101141.6525
57294.4854
12.0
52991.0
101070.0
150803.0
179445.0
189790.0
199808.0
test_dataset_final
CreditScore
647.0
644.4219
93.1597
350.0
578.0
647.0
710.0
764.0
795.0
850.0
test_dataset_final
Tenure
647.0
4.9521
2.9103
0.0
2.0
5.0
7.0
9.0
10.0
10.0
test_dataset_final
Balance
647.0
84048.4535
61349.2222
0.0
0.0
104414.0
129786.0
150681.0
165412.0
250898.0
test_dataset_final
NumOfProducts
647.0
1.4853
0.6990
1.0
1.0
1.0
2.0
2.0
3.0
4.0
test_dataset_final
HasCrCard
647.0
0.6955
0.4605
0.0
0.0
1.0
1.0
1.0
1.0
1.0
test_dataset_final
IsActiveMember
647.0
0.4544
0.4983
0.0
0.0
0.0
1.0
1.0
1.0
1.0
test_dataset_final
EstimatedSalary
647.0
98259.4324
57175.6715
123.0
51162.0
99922.0
145913.0
177889.0
190674.0
199953.0
2026-04-03 03:10:00,845 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document
The Descriptive Statistics test evaluates the distributional characteristics and completeness of numerical variables in the development and test datasets. The results present summary statistics, including mean, minimum, maximum, and missingness percentage, for each variable across both datasets. All variables are reported with their respective data types and observation counts, providing a comprehensive overview of the dataset structure and integrity.
Key insights:
No missing values detected: All numerical variables in both the development and test datasets have 0.0% missing values, indicating complete data coverage.
Consistent variable ranges across datasets: Minimum and maximum values for variables such as CreditScore, Tenure, NumOfProducts, HasCrCard, IsActiveMember, and Exited are identical between development and test datasets.
Stable central tendencies: Mean values for all variables are closely aligned between the development and test datasets, with differences remaining minimal (e.g., CreditScore mean differs by 4.8 points, EstimatedSalary by approximately 2,882 units).
Data types are appropriate and consistent: All variables are typed as either int64 or float64, matching their expected numerical nature.
The descriptive statistics indicate a high degree of data completeness and consistency between the development and test datasets. Variable distributions, central tendencies, and data types are stable and well-aligned, supporting reliable downstream modeling and analysis. No data quality issues or anomalies are observed in the reported statistics.
Tables
dataset
Numerical Variable
Num of Obs
Mean
Min
Max
Missing Values (%)
Data Type
train_dataset_final
CreditScore
2585
649.2371
350.00
850.00
0.0
int64
train_dataset_final
Tenure
2585
5.0193
0.00
10.00
0.0
int64
train_dataset_final
Balance
2585
81144.9986
0.00
222267.63
0.0
float64
train_dataset_final
NumOfProducts
2585
1.5157
1.00
4.00
0.0
int64
train_dataset_final
HasCrCard
2585
0.7041
0.00
1.00
0.0
int64
train_dataset_final
IsActiveMember
2585
0.4654
0.00
1.00
0.0
int64
train_dataset_final
EstimatedSalary
2585
101141.6525
11.58
199808.10
0.0
float64
train_dataset_final
Exited
2585
0.5002
0.00
1.00
0.0
int64
test_dataset_final
CreditScore
647
644.4219
350.00
850.00
0.0
int64
test_dataset_final
Tenure
647
4.9521
0.00
10.00
0.0
int64
test_dataset_final
Balance
647
84048.4535
0.00
250898.09
0.0
float64
test_dataset_final
NumOfProducts
647
1.4853
1.00
4.00
0.0
int64
test_dataset_final
HasCrCard
647
0.6955
0.00
1.00
0.0
int64
test_dataset_final
IsActiveMember
647
0.4544
0.00
1.00
0.0
int64
test_dataset_final
EstimatedSalary
647
98259.4324
123.07
199953.33
0.0
float64
test_dataset_final
Exited
647
0.4992
0.00
1.00
0.0
int64
2026-04-03 03:10:04,978 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document
The Class Imbalance test evaluates the distribution of target classes within the training and test datasets to identify potential imbalances that could impact model performance. The results present the percentage representation of each class in both datasets, with a minimum threshold of 10% set for each class to pass. Bar plots visualize the proportion of each class, supporting interpretation of class balance.
Key insights:
Near-equal class distribution in all datasets: Both the training and test datasets show class proportions close to 50% for each class (Exited = 0 and Exited = 1), with values ranging from 49.92% to 50.08%.
All classes exceed minimum threshold: Each class in both datasets surpasses the 10% minimum percentage threshold, resulting in a "Pass" outcome for all evaluated classes.
Consistent class balance across splits: The class distribution remains stable between the training and test datasets, indicating no shift in class proportions between data splits.
The results demonstrate that the target variable is evenly distributed across both the training and test datasets, with no evidence of class imbalance. All classes meet the predefined minimum representation threshold, supporting the suitability of the data for unbiased model training and evaluation.
Parameters:
{
"min_percent_threshold": 10
}
Tables
dataset
Exited
Percentage of Rows (%)
Pass/Fail
train_dataset_final
1
50.02%
Pass
train_dataset_final
0
49.98%
Pass
test_dataset_final
0
50.08%
Pass
test_dataset_final
1
49.92%
Pass
Figures
2026-04-03 03:10:08,942 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document
The UniqueRows test evaluates the diversity of the dataset by measuring the proportion of unique values in each column and comparing it to a minimum percentage threshold. The results table presents, for both the training and test datasets, the number and percentage of unique values per column, along with a pass/fail outcome based on whether the percentage exceeds the 1% threshold. Columns with a high proportion of unique values are marked as passing, while those with lower diversity are marked as failing.
Key insights:
High uniqueness in continuous variables: Columns such as EstimatedSalary and Balance exhibit high percentages of unique values (up to 100% in both datasets), consistently passing the uniqueness threshold.
Low uniqueness in categorical variables: Columns representing categorical or binary features (e.g., HasCrCard, IsActiveMember, Geography_Germany, Gender_Male, Exited) show very low percentages of unique values (all below 1%), resulting in a fail outcome.
Mixed results for ordinal variables: Tenure passes the threshold in the test dataset (1.70%) but fails in the training dataset (0.43%), indicating variability in uniqueness across splits.
Consistent patterns across datasets: The same columns tend to pass or fail in both the training and test datasets, reflecting stable data structure and encoding practices.
The results indicate that continuous variables in the dataset provide substantial diversity, while categorical and binary variables inherently display low uniqueness and do not meet the prescribed threshold. This pattern is consistent across both training and test datasets, highlighting the influence of variable type on uniqueness outcomes. The test effectively distinguishes between variable types in terms of data diversity, with no evidence of unexpected duplication or lack of variety in continuous features.
Parameters:
{
"min_percent_threshold": 1
}
Tables
dataset
Column
Number of Unique Values
Percentage of Unique Values (%)
Pass/Fail
train_dataset_final
CreditScore
426
16.4797
Pass
train_dataset_final
Tenure
11
0.4255
Fail
train_dataset_final
Balance
1742
67.3888
Pass
train_dataset_final
NumOfProducts
4
0.1547
Fail
train_dataset_final
HasCrCard
2
0.0774
Fail
train_dataset_final
IsActiveMember
2
0.0774
Fail
train_dataset_final
EstimatedSalary
2585
100.0000
Pass
train_dataset_final
Geography_Germany
2
0.0774
Fail
train_dataset_final
Geography_Spain
2
0.0774
Fail
train_dataset_final
Gender_Male
2
0.0774
Fail
train_dataset_final
Exited
2
0.0774
Fail
test_dataset_final
CreditScore
294
45.4405
Pass
test_dataset_final
Tenure
11
1.7002
Pass
test_dataset_final
Balance
453
70.0155
Pass
test_dataset_final
NumOfProducts
4
0.6182
Fail
test_dataset_final
HasCrCard
2
0.3091
Fail
test_dataset_final
IsActiveMember
2
0.3091
Fail
test_dataset_final
EstimatedSalary
647
100.0000
Pass
test_dataset_final
Geography_Germany
2
0.3091
Fail
test_dataset_final
Geography_Spain
2
0.3091
Fail
test_dataset_final
Gender_Male
2
0.3091
Fail
test_dataset_final
Exited
2
0.3091
Fail
2026-04-03 03:10:15,109 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document
The Tabular Numerical Histograms test provides a visual assessment of the distribution of each numerical feature in both the development (train) and test datasets. The resulting histograms display the frequency distribution for each variable, enabling identification of skewness, outliers, and concentration patterns. This visualization supports the evaluation of data quality and the detection of potential distributional issues that may impact model performance.
Key insights:
CreditScore displays moderate right skew: Both train and test datasets show a concentration of values between 550 and 750, with a tail extending toward higher scores and a small number of extreme values above 800.
Tenure is nearly uniform with edge effects: Tenure values are distributed relatively evenly across most bins, with lower frequencies at the minimum and maximum values in both datasets.
Balance exhibits strong zero-inflation: A substantial proportion of records have a balance of zero, with the remainder forming a roughly symmetric distribution centered around 120,000.
NumOfProducts is highly concentrated at lower values: The majority of records have one or two products, with very few instances at three or four products.
HasCrCard and IsActiveMember are binary and imbalanced: Most records have a credit card (HasCrCard = 1) and a slight majority are not active members (IsActiveMember = 0), with similar patterns in both datasets.
EstimatedSalary is uniformly distributed: The salary variable shows a flat distribution across its range, indicating no significant skew or clustering.
Geography and Gender features show categorical imbalances: Geography_Germany and Geography_Spain are both more frequently false than true, while Gender_Male is nearly balanced between true and false.
The histograms reveal that most numerical features maintain consistent distributional patterns between the train and test datasets, with no evidence of major distributional drift. Several features, such as Balance and NumOfProducts, display strong concentration at specific values, while CreditScore and EstimatedSalary exhibit broader, more continuous distributions. The presence of zero-inflation in Balance and categorical imbalances in certain features are notable characteristics of the dataset. Overall, the input data distributions are well-characterized, supporting further model analysis and validation.
Figures
2026-04-03 03:10:24,046 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document
The Mutual Information test evaluates the statistical dependency between each feature and the target variable to quantify feature relevance for model training. The results are presented as bar plots of mutual information scores for both the training and test datasets, with a threshold line at 0.01 indicating the minimum relevance level. Scores are normalized between 0 and 1, allowing for direct comparison of feature importance across the input variables.
Key insights:
NumOfProducts consistently highest relevance: NumOfProducts exhibits the highest mutual information score in both training and test datasets, with values near 0.10 and 0.11, respectively, indicating a strong relationship with the target variable.
Limited high-relevance features: Only a small subset of features (NumOfProducts, Gender_Male, IsActiveMember in training; NumOfProducts, IsActiveMember, CreditScore in test) exceed the 0.01 threshold, while most features register scores at or below this level.
Stable feature ranking across datasets: The relative ranking of feature importance remains consistent between training and test datasets, with the same features generally appearing above the threshold in both splits.
Majority of features show low information content: Several features, including HasCrCard, EstimatedSalary, and Geography_Spain, display mutual information scores at or near zero in both datasets, indicating minimal direct association with the target.
The mutual information analysis reveals that only a limited number of features demonstrate substantial relevance to the target variable, with NumOfProducts consistently dominating in both training and test datasets. The majority of features exhibit low or negligible mutual information scores, suggesting limited direct predictive value. The observed stability in feature ranking across data splits supports the robustness of these findings.
Parameters:
{
"min_threshold": 0.01
}
Figures
2026-04-03 03:10:32,105 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document
The Pearson Correlation Matrix test evaluates the linear relationships between all pairs of numerical variables in the dataset, providing a heat map visualization of the correlation coefficients. The results display the correlation structure for both the training and test datasets, with coefficients ranging from -1 to 1, where values closer to ±1 indicate stronger linear relationships. The heat maps highlight any coefficients exceeding an absolute value of 0.7, signaling high correlation, and allow for visual assessment of potential redundancy among variables.
Key insights:
No high correlations detected: All pairwise correlation coefficients in both training and test datasets remain below the 0.7 threshold, indicating the absence of strong linear dependencies among variables.
Consistent correlation structure across splits: The correlation patterns are stable between the training and test datasets, with the highest observed correlations (e.g., Balance and Geography_Germany at 0.42) remaining moderate and consistent.
Negative correlations limited in magnitude: The most negative correlations, such as between Geography_Spain and Geography_Germany (approximately -0.36 to -0.37), do not approach the high-correlation threshold and are consistent across both data splits.
The correlation analysis demonstrates that the numerical variables in the dataset exhibit low to moderate linear relationships, with no evidence of multicollinearity or redundancy based on the 0.7 threshold. The stability of the correlation structure between training and test datasets further supports the reliability of the variable set for modeling purposes.
Figures
2026-04-03 03:10:39,016 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document
The High Pearson Correlation test evaluates the linear relationships between feature pairs in the dataset to identify potential feature redundancy or multicollinearity. The results table presents the top ten strongest correlations for both the training and test datasets, indicating the Pearson correlation coefficient and whether each pair passes or fails the pre-set threshold of 0.3. Correlation coefficients are shown for pairs such as (Balance, Geography_Germany) and (Geography_Germany, Geography_Spain), with Pass/Fail status determined by whether the absolute value exceeds the threshold.
Key insights:
Two feature pairs exceed correlation threshold: In both the training and test datasets, the pairs (Balance, Geography_Germany) and (Geography_Germany, Geography_Spain) display absolute correlation coefficients above the 0.3 threshold, with values ranging from 0.3612 to 0.4167, resulting in a Fail status for these pairs.
All other feature pairs below threshold: The remaining feature pairs in both datasets have absolute correlation coefficients below 0.3, resulting in a Pass status.
Consistency across datasets: The same feature pairs with high correlations are observed in both the training and test datasets, with similar coefficient magnitudes, indicating stable correlation structure across data splits.
Highest observed correlation is moderate: The maximum absolute correlation coefficient observed is 0.4167, which, while above the threshold, does not approach levels typically associated with severe multicollinearity.
The test results indicate that most feature pairs exhibit low to moderate linear relationships, with only two pairs exceeding the specified correlation threshold in both datasets. The correlation structure is consistent between training and test data, and the highest observed correlations are moderate in magnitude. This suggests limited risk of feature redundancy or multicollinearity based on the tested threshold, with only isolated pairs warranting further consideration.
2026-04-03 03:10:53,258 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document
validmind.model_validation.ModelMetadata
Model Metadata
The ModelMetadata test compares key metadata fields across models to assess consistency in architecture, framework, version, and programming language. The summary table presents side-by-side metadata for each model, including modeling technique, framework, framework version, and programming language. Both the log_model_champion and rf_model are included in the comparison, with all relevant metadata fields displayed for each.
Key insights:
Consistent modeling technique and framework: Both models use the SKlearnModel technique and the sklearn framework.
Identical framework versions: The framework version is 1.8.0 for both models.
Uniform programming language: Python is the programming language for both models.
The metadata comparison reveals complete alignment across all evaluated fields for the included models. No inconsistencies or missing metadata are observed, indicating a standardized approach to model development and documentation within this set.
Tables
model
Modeling Technique
Modeling Framework
Framework Version
Programming Language
log_model_champion
SKlearnModel
sklearn
1.8.0
Python
rf_model
SKlearnModel
sklearn
1.8.0
Python
2026-04-03 03:10:56,678 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document
The Model Parameters test provides a structured summary of all configuration parameters for the models used in this workflow. The results table lists each parameter name and its corresponding value for both the logistic regression and random forest models, enabling transparency and supporting reproducibility. This information allows for detailed auditing of model setup and facilitates comparison across model versions or retraining cycles.
Key insights:
Distinct parameterization for each model: The logistic regression model ("log_model_champion") and the random forest model ("rf_model") each display a full set of parameters, with no missing values in the extracted configuration.
Logistic regression uses L1 regularization: The penalty parameter is set to "l1" with the "liblinear" solver, and regularization strength (C) is set to 1, indicating a standard configuration for sparse solutions.
Random forest configured with 50 estimators: The random forest model uses 50 trees, "gini" criterion, and "sqrt" for max_features, with bootstrap sampling enabled and a fixed random_state of 42.
Default and explicit parameter values present: Several parameters, such as min_samples_leaf (1), min_samples_split (2), and ccp_alpha (0.0), are set to their default values, while others like random_state and n_estimators are explicitly specified.
The extracted parameter set provides a comprehensive and transparent record of model configuration for both logistic regression and random forest models. All key parameters are present and clearly defined, supporting reproducibility and facilitating future audits or comparisons. The parameter choices reflect standard practices for these model types, with explicit settings for core hyperparameters and defaults retained where appropriate.
Tables
model
Parameter
Value
log_model_champion
C
1
log_model_champion
dual
False
log_model_champion
fit_intercept
True
log_model_champion
intercept_scaling
1
log_model_champion
max_iter
100
log_model_champion
penalty
l1
log_model_champion
solver
liblinear
log_model_champion
tol
0.0001
log_model_champion
verbose
0
log_model_champion
warm_start
False
rf_model
bootstrap
True
rf_model
ccp_alpha
0.0
rf_model
criterion
gini
rf_model
max_features
sqrt
rf_model
min_impurity_decrease
0.0
rf_model
min_samples_leaf
1
rf_model
min_samples_split
2
rf_model
min_weight_fraction_leaf
0.0
rf_model
n_estimators
50
rf_model
oob_score
False
rf_model
random_state
42
rf_model
verbose
0
rf_model
warm_start
False
2026-04-03 03:11:05,310 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document
validmind.model_validation.sklearn.ROCCurve
ROC Curve
The ROC Curve test evaluates the binary classification performance of the model by plotting the Receiver Operating Characteristic (ROC) curve and calculating the Area Under the Curve (AUC) for both the training and test datasets. The resulting plots display the trade-off between the true positive rate and false positive rate at various thresholds, with the AUC quantifying the model's ability to distinguish between classes. The ROC curves for both datasets are compared against a baseline representing random classification (AUC = 0.5).
Key insights:
AUC indicates moderate discriminative ability: The AUC is 0.68 on the training dataset and 0.70 on the test dataset, reflecting moderate separation between positive and negative classes.
Consistent performance across datasets: The similarity of AUC values between training and test datasets suggests stable model behavior and limited overfitting.
ROC curves remain above random baseline: Both ROC curves are consistently above the random classifier line, indicating the model provides meaningful predictive power beyond chance.
The ROC Curve test results demonstrate that the model achieves moderate discriminative performance, with AUC values above the random baseline on both training and test datasets. The close alignment of AUC scores across datasets indicates stable generalization, and the ROC curves confirm the model's ability to distinguish between classes at various thresholds.
Figures
2026-04-03 03:11:09,219 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document
The Minimum ROC AUC Score test evaluates whether the model's multiclass ROC AUC score meets or exceeds a specified threshold, providing an assessment of the model's ability to distinguish between classes. The results table presents ROC AUC scores for both the training and test datasets, alongside the minimum threshold and the corresponding pass/fail status. Both datasets are evaluated against a threshold of 0.5, with observed scores and outcomes reported for each.
Key insights:
ROC AUC scores exceed threshold: Both the training (0.6764) and test (0.705) datasets have ROC AUC scores above the minimum threshold of 0.5.
Consistent test outcomes across datasets: The test is marked as "Pass" for both the training and test datasets, indicating consistent model performance in distinguishing between classes.
The results indicate that the model demonstrates adequate discriminatory power on both the training and test datasets, as measured by the multiclass ROC AUC metric. The observed scores surpass the predefined threshold, and the test outcomes confirm that the model meets the minimum performance criterion for this metric across both evaluation datasets.
Parameters:
{
"min_threshold": 0.5
}
Tables
dataset
Score
Threshold
Pass/Fail
train_dataset_final
0.6764
0.5
Pass
test_dataset_final
0.7050
0.5
Pass
2026-04-03 03:11:13,447 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document
In summary
In this final notebook, you learned how to:
With our ValidMind for model validation series of notebooks, you learned how to validate a model end-to-end with the ValidMind Library by running through some common scenarios in a typical model validation setting:
Verifying the data quality steps performed by the model development team
Independently replicating the champion model's results and conducting additional tests to assess performance, stability, and robustness
Setting up test inputs and a challenger model for comparative analysis
Running validation tests, analyzing results, and logging artifacts to ValidMind
Next steps
Work with your validation report
Now that you've logged all your test results and verified the work done by the model development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:
Inserting additional test results: Click Link Evidence to Report under any section of 2. Validation in your validation report. (Learn more: Link evidence to reports)
Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)
Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage model findings)
Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)
Assessing compliance: Under the Guideline for any validation report section, click ASSESSMENT and select the compliance status from the drop-down menu. (Learn more: Provide compliance assessments)
Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including model developers. Propose suggested changes in the model documentation, work with versioned history, and use comments to discuss specific portions of the model documentation. (Learn more: Collaborate with others)
When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough model validation history. (Learn more: Submit for approval)
Learn more
Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining model validation: