ValidMind for validation 2 — Start the validation process

Learn how to use ValidMind for your end-to-end validation process with our series of four introductory notebooks. In this second notebook, independently verify the data quality tests performed on the dataset used to train the champion.

You'll learn how to run relevant validation tests with ValidMind, log the results of those tests to the ValidMind Platform, and insert your logged test results as evidence into your validation report. You'll become familiar with the tests available in ValidMind, as well as how to run them. Running tests during validation is crucial to the effective challenge process, as we want to independently evaluate the evidence and assessments provided by the development team.

While running our tests in this notebook, we'll focus on:

Ensuring that data used for training and testing the champion is of appropriate data quality
Ensuring that the raw data has been preprocessed appropriately and that the resulting final datasets reflects this

For a full list of out-of-the-box tests and descriptions, use the interactive ValidMind test sandbox.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to independently assess the quality of your datasets with notebook, you'll need to first have:

Registered a model within the ValidMind Platform and granted yourself access to the model as a validator
Installed the ValidMind Library in your local environment, allowing you to access all its features

Need help with the above steps?

Refer to the first notebook in this series: 1 — Set up the ValidMind Library for validation

Setting up

Initialize the ValidMind Library

First, let's connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

On the left sidebar that appears for your model, select Getting Started and select Validation from the Document drop-down menu.
Click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    document="validation-report",
)

Note: you may need to restart the kernel to use updated packages.

2026-07-14 05:32:41,268 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Load the sample dataset

Let's first import the public Bank Customer Churn Prediction dataset from Kaggle, which was used to develop the dummy champion.

We'll use this dataset to review steps that should have been conducted during the initial development and documentation of the champion to ensure that the model was built correctly. By independently performing steps taken by the development team, we can confirm whether the model was built using appropriate and properly processed data.

In our below example, note that:

The target column, Exited has a value of 1 when a customer has churned and 0 otherwise.
The ValidMind Library provides a wrapper to automatically load the dataset as a Pandas DataFrame object. A Pandas Dataframe is a two-dimensional tabular data structure that makes use of rows and columns.

from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
raw_df.head()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0	619	France	Female	42	2	0.00	1	1	1	101348.88	1
1	608	Spain	Female	41	1	83807.86	1	0	1	112542.58	0
2	502	France	Female	42	8	159660.80	3	1	0	113931.57	1
3	699	France	Female	39	1	0.00	2	0	0	93826.63	0
4	850	Spain	Female	43	2	125510.82	1	1	1	79084.10	0

Verifying data quality adjustments

Let's say that thanks to the documentation submitted by the development team (Learn more: ValidMind for development), we know that the sample dataset was first modified before being used to train the champion. After performing some data quality assessments on the raw dataset, it was determined that the dataset required rebalancing, and highly correlated features were also removed.

Identify qualitative tests

During validation, we use the same data processing logic and training procedure to confirm that the model's results can be reproduced independently, so let's start by doing some data quality assessments by running a few individual tests just like the development team did.

Use the vm.tests.list_tests() function introduced by the first notebook in this series in combination with vm.tests.list_tags() and vm.tests.list_tasks() to find which prebuilt tests are relevant for data quality assessment:

tasks represent the kind of modeling task associated with a test. Here we'll focus on classification tasks.
tags are free-form descriptions providing more details about the test, for example, what category the test falls into. Here we'll focus on the data_quality tag.

# Get the list of available task types
sorted(vm.tests.list_tasks())

['classification',
 'clustering',
 'data_validation',
 'feature_extraction',
 'monitoring',
 'nlp',
 'regression',
 'residual_analysis',
 'text_classification',
 'text_generation',
 'text_qa',
 'text_summarization',
 'time_series_forecasting',
 'visualization']

# Get the list of available tags
sorted(vm.tests.list_tags())

['AUC',
 'analysis',
 'anomaly',
 'anomaly_detection',
 'bias_and_fairness',
 'binary_classification',
 'calibration',
 'categorical_data',
 'classification',
 'classification_metrics',
 'clustering',
 'correlation',
 'credit_risk',
 'data_analysis',
 'data_distribution',
 'data_quality',
 'data_validation',
 'descriptive_statistics',
 'dimensionality_reduction',
 'distribution',
 'embeddings',
 'feature_importance',
 'feature_selection',
 'few_shot',
 'forecasting',
 'frequency_analysis',
 'kmeans',
 'linear_regression',
 'llm',
 'logistic_regression',
 'metadata',
 'model_comparison',
 'model_diagnosis',
 'model_explainability',
 'model_interpretation',
 'model_performance',
 'model_predictions',
 'model_selection',
 'model_training',
 'model_validation',
 'multiclass_classification',
 'nlp',
 'normality',
 'numerical_data',
 'outlier',
 'outliers',
 'qualitative',
 'rag_performance',
 'ragas',
 'regression',
 'retrieval_performance',
 'scorecard',
 'seasonality',
 'senstivity_analysis',
 'sklearn',
 'stationarity',
 'statistical_test',
 'statistics',
 'statsmodels',
 'tabular_data',
 'text_data',
 'threshold_optimization',
 'time_series_data',
 'unit_root_test',
 'visualization',
 'zero_shot']

You can pass tags and tasks as parameters to the vm.tests.list_tests() function to filter the tests based on the tags and task types.

For example, to find tests related to tabular data quality for classification models, you can call list_tests() like this:

vm.tests.list_tests(task="classification", tags=["tabular_data", "data_quality"])

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.data_validation.ClassImbalance	Class Imbalance	Evaluates and quantifies class distribution imbalance in a dataset used by a machine learning model....	True	True	['dataset']	{'min_percent_threshold': {'type': 'int', 'default': 10}}	['tabular_data', 'binary_classification', 'multiclass_classification', 'data_quality']	['classification']
validmind.data_validation.DescriptiveStatistics	Descriptive Statistics	Performs a detailed descriptive statistical analysis of both numerical and categorical data within a model's...	False	True	['dataset']	{}	['tabular_data', 'time_series_data', 'data_quality']	['classification', 'regression']
validmind.data_validation.Duplicates	Duplicates	Tests dataset for duplicate entries, ensuring model reliability via data quality verification....	False	True	['dataset']	{'min_threshold': {'type': '_empty', 'default': 1}}	['tabular_data', 'data_quality', 'text_data']	['classification', 'regression']
validmind.data_validation.HighCardinality	High Cardinality	Assesses the number of unique values in categorical columns to detect high cardinality and potential overfitting....	False	True	['dataset']	{'num_threshold': {'type': 'int', 'default': 100}, 'percent_threshold': {'type': 'float', 'default': 0.1}, 'threshold_type': {'type': 'str', 'default': 'percent'}}	['tabular_data', 'data_quality', 'categorical_data']	['classification', 'regression']
validmind.data_validation.HighPearsonCorrelation	High Pearson Correlation	Identifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity....	False	True	['dataset']	{'max_threshold': {'type': 'float', 'default': 0.3}, 'top_n_correlations': {'type': 'int', 'default': 10}, 'feature_columns': {'type': 'list', 'default': None}}	['tabular_data', 'data_quality', 'correlation']	['classification', 'regression']
validmind.data_validation.MissingValues	Missing Values	Evaluates dataset quality by ensuring missing value percentage across all features does not exceed a set threshold....	False	True	['dataset']	{'min_percentage_threshold': {'type': 'float', 'default': 1.0}}	['tabular_data', 'data_quality']	['classification', 'regression']
validmind.data_validation.MissingValuesBarPlot	Missing Values Bar Plot	Assesses the percentage and distribution of missing values in the dataset via a bar plot, with emphasis on...	True	False	['dataset']	{'threshold': {'type': 'int', 'default': 80}, 'fig_height': {'type': 'int', 'default': 600}}	['tabular_data', 'data_quality', 'visualization']	['classification', 'regression']
validmind.data_validation.Skewness	Skewness	Evaluates the skewness of numerical data in a dataset to check against a defined threshold, aiming to ensure data...	False	True	['dataset']	{'max_threshold': {'type': '_empty', 'default': 1}}	['data_quality', 'tabular_data']	['classification', 'regression']
validmind.plots.BoxPlot	Box Plot	Generates customizable box plots for numerical features in a dataset with optional grouping using Plotly....	True	False	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'group_by': {'type': 'Optional', 'default': None}, 'width': {'type': 'int', 'default': 1800}, 'height': {'type': 'int', 'default': 1200}, 'colors': {'type': 'Optional', 'default': None}, 'show_outliers': {'type': 'bool', 'default': True}, 'title_prefix': {'type': 'str', 'default': 'Box Plot of'}}	['tabular_data', 'visualization', 'data_quality']	['classification', 'regression', 'clustering']
validmind.plots.HistogramPlot	Histogram Plot	Generates customizable histogram plots for numerical features in a dataset using Plotly....	True	False	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'bins': {'type': 'Union', 'default': 30}, 'color': {'type': 'str', 'default': 'steelblue'}, 'opacity': {'type': 'float', 'default': 0.7}, 'show_kde': {'type': 'bool', 'default': True}, 'normalize': {'type': 'bool', 'default': False}, 'log_scale': {'type': 'bool', 'default': False}, 'title_prefix': {'type': 'str', 'default': 'Histogram of'}, 'width': {'type': 'int', 'default': 1200}, 'height': {'type': 'int', 'default': 800}, 'n_cols': {'type': 'int', 'default': 2}, 'vertical_spacing': {'type': 'float', 'default': 0.15}, 'horizontal_spacing': {'type': 'float', 'default': 0.1}}	['tabular_data', 'visualization', 'data_quality']	['classification', 'regression', 'clustering']
validmind.stats.DescriptiveStats	Descriptive Stats	Provides comprehensive descriptive statistics for numerical features in a dataset....	False	True	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'include_advanced': {'type': 'bool', 'default': True}, 'confidence_level': {'type': 'float', 'default': 0.95}}	['tabular_data', 'statistics', 'data_quality']	['classification', 'regression', 'clustering']

Want to learn more about navigating ValidMind tests?

Refer to our notebook outlining the utilities available for viewing and understanding available ValidMind tests: Explore tests

Initialize the ValidMind dataset

With the individual tests we want to run identified, the next step is to connect your data with a ValidMind Dataset object. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind, but you only need to do it once per dataset.

Initialize a ValidMind dataset object using the init_dataset function from the ValidMind (vm) module. For this example, we'll pass in the following arguments:

dataset — The raw dataset that you want to provide as input to tests.
input_id — A unique identifier that allows tracking what inputs are used when running each individual test.
target_column — A required argument if tests require access to true values. This is the name of the target column in the dataset.

# vm_raw_dataset is now a VMDataset object that you can pass to any ValidMind test
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)

Run data quality tests

Now that we know how to initialize a ValidMind dataset object, we're ready to run some tests!

You run individual tests by calling the run_test function provided by the validmind.tests module. For the examples below, we'll pass in the following arguments:

test_id — The ID of the test to run, as seen in the ID column when you run list_tests.
params — A dictionary of parameters for the test. These will override any default_params set in the test definition.

Run tabular data tests

The inputs expected by a test can also be found in the test definition — let's take validmind.data_validation.DescriptiveStatistics as an example.

Note that the output of the describe_test() function below shows that this test expects a dataset as input:

vm.tests.describe_test("validmind.data_validation.DescriptiveStatistics")

▶ Test: Descriptive Statistics ('validmind.data_validation.DescriptiveStatistics')

Descriptive Statistics

Performs a detailed descriptive statistical analysis of both numerical and categorical data within a model's dataset.

Purpose

The purpose of the Descriptive Statistics metric is to provide a comprehensive summary of both numerical and categorical data within a dataset. This involves statistics such as count, mean, standard deviation, minimum and maximum values for numerical data. For categorical data, it calculates the count, number of unique values, most common value and its frequency, and the proportion of the most frequent value relative to the total. The goal is to visualize the overall distribution of the variables in the dataset, aiding in understanding the model's behavior and predicting its performance.

Test Mechanism

The testing mechanism utilizes two in-built functions of pandas dataframes: describe() for numerical fields and value_counts() for categorical fields. The describe() function pulls out several summary statistics, while value_counts() accounts for unique values. The resulting data is formatted into two distinct tables, one for numerical and another for categorical variable summaries. These tables provide a clear summary of the main characteristics of the variables, which can be instrumental in assessing the model's performance.

Signs of High Risk

Skewed data or significant outliers can represent high risk. For numerical data, this may be reflected via a significant difference between the mean and median (50% percentile).
For categorical data, a lack of diversity (low count of unique values), or overdominance of a single category (high frequency of the top value) can indicate high risk.

Strengths

Provides a comprehensive summary of the dataset, shedding light on the distribution and characteristics of the variables under consideration.
It is a versatile and robust method, applicable to both numerical and categorical data.
Helps highlight crucial anomalies such as outliers, extreme skewness, or lack of diversity, which are vital in understanding model behavior during testing and validation.

Limitations

While this metric offers a high-level overview of the data, it may fail to detect subtle correlations or complex patterns.
Does not offer any insights on the relationship between variables.
Alone, descriptive statistics cannot be used to infer properties about future unseen data.
Should be used in conjunction with other statistical tests to provide a comprehensive understanding of the model's data.

Required Inputs: dataset

How to Run:

Code:

        
import validmind as vm

# inputs dictionary maps your inputs to the expected input names
# keys are the expected input names and values are the actual inputs
# values may be string input_ids or the actual VMDataset or VMModel objects
inputs = {
    "dataset": "my_vm_dataset"
}
params = {}

# to run and view the result of this test, run the following code:
result = vm.tests.run_test(
  "validmind.data_validation.DescriptiveStatistics", inputs=inputs, params=params
)

# To see the result of the test, ensure that you have called `vm.init()` and then run:
result.log()

Now, let's run a few tests to assess the quality of the dataset:

result2 = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_raw_dataset},
    params={"min_percent_threshold": 30},
)

❌ Class Imbalance

The Class Imbalance test evaluates the distribution of target classes in the dataset by measuring the percentage share of each class against a minimum threshold. For the Exited target, the results table and accompanying bar chart show two classes with materially different representation levels. Class 0 accounts for 79.80% of rows and is marked as passing, while class 1 accounts for 20.20% of rows and is marked as failing under the configured minimum percentage threshold of 30%.

Key insights:

Minority class falls below threshold: Class 1 represents 20.20% of observations, which is below the configured 30% minimum threshold and is therefore flagged as failing.
Majority class dominates the distribution: Class 0 accounts for 79.80% of rows, making it the substantially more represented class in the target distribution.
Imbalance is clearly visible across classes: The gap between the two class proportions is 59.60 percentage points, as reflected consistently in both the tabular output and the bar chart.

The test results indicate a materially uneven target distribution for Exited, with one class comprising nearly four-fifths of the dataset and the other comprising roughly one-fifth. Under the applied 30% threshold, only class 0 passes and class 1 fails. Collectively, the results document that the target distribution is imbalanced according to the configured test criterion.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	79.80%	Pass
1	20.20%	Fail

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:37a3

The output above shows that the validmind.data_validation.ClassImbalance test did not pass according to the value we set for min_percent_threshold — great, this matches what was reported by the development team.

To address this issue, we'll re-run the test on some processed data. In this case let's apply a very simple rebalancing technique to the dataset:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

With this new balanced dataset, you can re-run the individual test to see if it now passes the class imbalance test requirement.

As this is technically a different dataset, remember to first initialize a new ValidMind Dataset object to pass in as input as required by run_test():

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

# Pass the initialized `balanced_raw_dataset` as input into the test run
result = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_balanced_raw_dataset},
    params={"min_percent_threshold": 30},
)

✅ Class Imbalance

The Class Imbalance test evaluates the distribution of target classes in the dataset by measuring each class share against a minimum percentage threshold. In this run, the target variable Exited contains two classes, 0 and 1, and the results table reports the percentage of rows for each class alongside a pass/fail outcome. Both classes account for 50.00% of observations, and both are marked as passing under the configured minimum threshold of 30%.

Key insights:

Balanced two-class distribution: The Exited target is evenly split between classes 0 and 1, with each representing 50.00% of the dataset.
All classes exceed threshold: Both observed classes pass the test against the configured minimum percentage threshold of 30%.
No underrepresented target class: The results show no class with a share below the threshold, and the accompanying bar chart reflects equal class proportions.

The test results indicate a fully balanced target distribution for Exited within the evaluated dataset. Each class comprises half of the observations and exceeds the 30% minimum threshold by a wide margin. Collectively, the table and chart show no observed class concentration or underrepresentation in the target variable.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	50.00%	Pass
1	50.00%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:a6be

Remove highly correlated features

Next, let's also remove highly correlated features from our dataset as outlined by the development team. Removing highly correlated features helps make the model simpler, more stable, and easier to understand.

You can utilize the output from a ValidMind test for further use — in this below example, to retrieve the list of features with the highest correlation coefficients and use them to reduce the final list of features for modeling.

First, we'll run validmind.data_validation.HighPearsonCorrelation with the balanced_raw_dataset we initialized previously as input as is for comparison with later runs:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify highly correlated variable pairs that may indicate redundancy or multicollinearity. The result table reports the top 10 feature pairs ranked by Pearson correlation coefficient, along with Pass/Fail status based on the configured absolute-correlation threshold of 0.3. Observed coefficients range from -0.1796 to 0.3465, with one pair exceeding the threshold and the remaining reported pairs falling within the passing range.

Key insights:

One pair exceeds threshold: The pair (Age, Exited) has a Pearson correlation coefficient of 0.3465, which is the only reported result with a Fail status under the 0.3 threshold.
Remaining reported correlations are modest: All other listed feature pairs have absolute correlation values below 0.18, including (Balance, NumOfProducts) at -0.1796 and (IsActiveMember, Exited) at -0.1697.
Reported relationships are mostly weak: Aside from (Age, Exited), the reported coefficients cluster close to zero, such as (NumOfProducts, Exited) at -0.0660, (Tenure, IsActiveMember) at -0.0366, and (Tenure, EstimatedSalary) at 0.0338.
Both positive and negative associations appear: The reported set includes positive coefficients, such as (Balance, Exited) at 0.1513, and negative coefficients, such as (Balance, NumOfProducts) at -0.1796, indicating mixed linear relationship directions across feature pairs.

The reported correlation structure is limited to a single pair above the configured threshold, with (Age, Exited) showing the strongest observed linear relationship at 0.3465. All other displayed feature pairs remain below the threshold and have relatively small magnitudes, with absolute values under 0.18. Overall, the top reported correlations indicate that stronger linear association is concentrated in one pair, while the remaining listed relationships are comparatively weak.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3465	Fail
(Balance, NumOfProducts)	-0.1796	Pass
(IsActiveMember, Exited)	-0.1697	Pass
(Balance, Exited)	0.1513	Pass
(NumOfProducts, Exited)	-0.0660	Pass
(HasCrCard, IsActiveMember)	-0.0527	Pass
(NumOfProducts, IsActiveMember)	0.0468	Pass
(Age, NumOfProducts)	-0.0463	Pass
(Tenure, IsActiveMember)	-0.0366	Pass
(Tenure, EstimatedSalary)	0.0338	Pass

The output above shows that the test did not pass according to the value we set for max_threshold — as reported and expected.

corr_result is an object of type TestResult. We can inspect the result object to see what the test has produced:

print(type(corr_result))
print("Result ID: ", corr_result.result_id)
print("Params: ", corr_result.params)
print("Passed: ", corr_result.passed)
print("Tables: ", corr_result.tables)

<class 'validmind.vm_models.result.result.TestResult'>
Result ID:  validmind.data_validation.HighPearsonCorrelation
Params:  {'max_threshold': 0.3}
Passed:  False
Tables:  [ResultTable]

Let's remove the highly correlated features and create a new VM dataset object.

We'll begin by checking out the table in the result and extracting a list of features that failed the test:

# Extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3465	Fail
1	(Balance, NumOfProducts)	-0.1796	Pass
2	(IsActiveMember, Exited)	-0.1697	Pass
3	(Balance, Exited)	0.1513	Pass
4	(NumOfProducts, Exited)	-0.0660	Pass
5	(HasCrCard, IsActiveMember)	-0.0527	Pass
6	(NumOfProducts, IsActiveMember)	0.0468	Pass
7	(Age, NumOfProducts)	-0.0463	Pass
8	(Tenure, IsActiveMember)	-0.0366	Pass
9	(Tenure, EstimatedSalary)	0.0338	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

Next, extract the feature names from the list of strings (example: (Age, Exited) > Age):

high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

Now, it's time to re-initialize the dataset with the highly correlated features removed.

Note the use of a different input_id. This allows tracking the inputs used when running each individual test.

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

Re-running the test with the reduced feature set should pass the test:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify highly correlated variable pairs that may indicate redundancy or multicollinearity. The result table reports the top 10 strongest feature-pair correlations after removing duplicate and self-correlations, using an absolute correlation threshold of 0.3 for pass/fail classification. In this output, all reported coefficients fall within a narrow range from -0.1796 to 0.1513, and each pair is marked as Pass.

Key insights:

No correlations exceed threshold: All 10 reported feature pairs pass the test against the 0.3 absolute correlation threshold. The largest absolute coefficient observed is 0.1796 for (Balance, NumOfProducts), which remains below the configured limit.
Strongest relationships are weak: The highest-magnitude correlations in the reported output are (Balance, NumOfProducts) at -0.1796, (IsActiveMember, Exited) at -0.1697, and (Balance, Exited) at 0.1513. These values indicate limited linear association among the strongest pairs shown.
Reported correlations are concentrated near zero: The remaining coefficients range from -0.0660 to 0.0468 among pairs including NumOfProducts, HasCrCard, Tenure, CreditScore, EstimatedSalary, IsActiveMember, and Exited. This concentration around zero indicates that the listed top correlations are modest in magnitude.

The reported Pearson correlation results show that none of the strongest observed feature-pair relationships exceed the specified threshold of 0.3. The largest absolute correlations are below 0.18, while most remaining values are materially closer to zero. Collectively, the output indicates a weak linear correlation structure among the top-ranked feature pairs returned by this test.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Balance, NumOfProducts)	-0.1796	Pass
(IsActiveMember, Exited)	-0.1697	Pass
(Balance, Exited)	0.1513	Pass
(NumOfProducts, Exited)	-0.0660	Pass
(HasCrCard, IsActiveMember)	-0.0527	Pass
(NumOfProducts, IsActiveMember)	0.0468	Pass
(Tenure, IsActiveMember)	-0.0366	Pass
(Tenure, EstimatedSalary)	0.0338	Pass
(CreditScore, IsActiveMember)	0.0324	Pass
(CreditScore, EstimatedSalary)	-0.0276	Pass

You can also plot the correlation matrix to visualize the new correlation between features:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.PearsonCorrelationMatrix",
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

Pearson Correlation Matrix

The PearsonCorrelationMatrix test evaluates linear dependency among numerical variables using pairwise Pearson correlation coefficients. The result is presented as a symmetric heat map covering CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, with coefficients ranging from -1 to 1 and 1.0 on the diagonal. Most off-diagonal correlations are clustered near zero, indicating weak linear relationships across the variables shown. The largest visible non-diagonal magnitudes are -0.18 between Balance and NumOfProducts, -0.17 between IsActiveMember and Exited, and 0.15 between Balance and Exited.

Key insights:

Correlations are uniformly low: All visible off-diagonal coefficients remain well below the 0.7 threshold referenced by the test methodology. The matrix does not show any pair of variables with strong positive or negative linear dependence.
Largest negative relationship is Balance–NumOfProducts: The most negative non-diagonal coefficient in the matrix is -0.18 between Balance and NumOfProducts. This indicates a weak inverse linear relationship relative to the other feature pairs.
Exited has limited linear association with inputs: Correlations between Exited and the other variables are all small in magnitude, with the largest being -0.17 for IsActiveMember and 0.15 for Balance. The remaining relationships with Exited are near zero, including CreditScore (-0.01), Tenure (-0.02), NumOfProducts (-0.07), HasCrCard (-0.01), and EstimatedSalary (-0.0).
Several feature pairs are effectively uncorrelated: Many coefficients are approximately zero, including Tenure–Balance (0.0), NumOfProducts–HasCrCard (-0.0), IsActiveMember–EstimatedSalary (-0.0), and EstimatedSalary–Exited (-0.0). This reflects minimal detectable linear association for these pairs in the dataset.

The correlation structure is sparse and weak across the numerical variables included in the test. No high-correlation pairs are present, and the observed relationships with the target variable Exited are also limited in magnitude. Overall, the heat map indicates low linear redundancy among the variables shown.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:cf01

Documenting test results

Now that we've done some analysis on two different datasets, we can use ValidMind to easily document why certain things were done to our raw data with testing to support it. Every test result returned by the run_test() function has a .log() method that can be used to send the test results to the ValidMind Platform.

When logging validation test results to the platform, you'll need to manually add those results to the desired section of the validation report. To demonstrate how to add test results to your validation report, we'll log our data quality tests and insert the results via the ValidMind Platform.

Configure and run comparison tests

Below, we'll perform comparison tests between the original raw dataset (raw_dataset) and the final preprocessed (raw_dataset_preprocessed) dataset, again logging the results to the ValidMind Platform.

We can specify all the tests we'd ike to run in a dictionary called test_config, and we'll pass in the following arguments for each test:

params: Individual test parameters.
input_grid: Individual test inputs to compare. In this case, we'll input our two datasets for comparison.

Note here that the input_grid expects the input_id of the dataset as the value rather than the variable name we specified:

# Individual test config with inputs specified
test_config = {
    "validmind.data_validation.ClassImbalance": {
        "input_grid": {"dataset": ["raw_dataset", "raw_dataset_preprocessed"]},
        "params": {"min_percent_threshold": 30}
    },
    "validmind.data_validation.HighPearsonCorrelation": {
        "input_grid": {"dataset": ["raw_dataset", "raw_dataset_preprocessed"]},
        "params": {"max_threshold": 0.3}
    },
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")

validmind.data_validation.ClassImbalance

❌ Class Imbalance

The Class Imbalance test evaluates the distribution of target classes in the dataset by measuring each class’s share of total records against the configured minimum percentage threshold of 30%. The results are reported separately for raw_dataset and raw_dataset_preprocessed for the target Exited. In raw_dataset, class Exited=0 represents 79.80% of rows and class Exited=1 represents 20.20%, while in raw_dataset_preprocessed both classes each represent 50.00% of rows. The pass/fail status shown in the results reflects whether each class meets the 30% threshold.

Key insights:

Raw dataset is imbalanced: In raw_dataset, Exited=0 accounts for 79.80% of observations and passes the threshold, while Exited=1 accounts for 20.20% and fails the 30% minimum.
Preprocessed dataset is fully balanced: In raw_dataset_preprocessed, both Exited=0 and Exited=1 represent 50.00% of rows, and both classes pass the threshold.
Class distribution changed materially after preprocessing: The class shares move from a 79.80% / 20.20% split in the raw data to an even 50.00% / 50.00% split in the preprocessed data.

The results show that class imbalance is present in the original dataset under the applied 30% threshold, driven by the lower representation of Exited=1. In contrast, the preprocessed dataset meets the threshold for both classes and exhibits an even class distribution. Taken together, the test indicates that the class composition differs substantially between the raw and preprocessed datasets.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

dataset	Exited	Percentage of Rows (%)	Pass/Fail
raw_dataset	0	79.80%	Pass
raw_dataset	1	20.20%	Fail
raw_dataset_preprocessed	0	50.00%	Pass
raw_dataset_preprocessed	1	50.00%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:46f0

ValidMind Figure validmind.data_validation.ClassImbalance:d368

2026-07-14 05:33:28,691 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance does not exist in model's document

validmind.data_validation.HighPearsonCorrelation

❌ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships between features to identify potentially redundant variables or multicollinearity. The results report the top correlations for both raw_dataset and raw_dataset_preprocessed, using an absolute correlation threshold of 0.3 to assign Pass or Fail status. In the raw dataset, one feature pair exceeds the threshold and is marked Fail, while all other reported correlations pass. In the preprocessed dataset, all reported feature pairs remain below the threshold.

Key insights:

One failed pair in raw data: The raw_dataset contains a single failed correlation, (Balance, NumOfProducts) with coefficient -0.3045, which marginally exceeds the 0.3 threshold in absolute value.
Preprocessing reduced the strongest correlation: The correlation between Balance and NumOfProducts declines from -0.3045 in raw_dataset to -0.1796 in raw_dataset_preprocessed, changing from Fail to Pass.
Age and Exited show notable association: In the raw dataset, (Age, Exited) has a coefficient of 0.281, making it the strongest passing correlation reported and the closest observed value to the threshold without exceeding it.
Remaining correlations are weak: Aside from the two largest raw correlations, all other reported coefficients in both datasets fall within approximately -0.17 to 0.15, indicating limited linear association among the listed feature pairs.

The results show a limited concentration of higher linear correlations, with only one pair in the raw dataset exceeding the configured threshold and no reported failures after preprocessing. The most material change is the reduction in correlation between Balance and NumOfProducts, while the rest of the reported feature relationships remain below the threshold and generally small in magnitude. Overall, the reported correlation structure is notably weaker in the preprocessed dataset than in the raw dataset.

Parameters:

{
  "max_threshold": 0.3
}

Tables

dataset	Columns	Coefficient	Pass/Fail
raw_dataset	(Balance, NumOfProducts)	-0.3045	Fail
raw_dataset	(Age, Exited)	0.2810	Pass
raw_dataset	(IsActiveMember, Exited)	-0.1515	Pass
raw_dataset	(Balance, Exited)	0.1174	Pass
raw_dataset	(Age, IsActiveMember)	0.0873	Pass
raw_dataset	(NumOfProducts, Exited)	-0.0523	Pass
raw_dataset	(Age, NumOfProducts)	-0.0306	Pass
raw_dataset	(CreditScore, IsActiveMember)	0.0306	Pass
raw_dataset	(Tenure, IsActiveMember)	-0.0293	Pass
raw_dataset	(Age, Balance)	0.0290	Pass
raw_dataset_preprocessed	(Balance, NumOfProducts)	-0.1796	Pass
raw_dataset_preprocessed	(IsActiveMember, Exited)	-0.1697	Pass
raw_dataset_preprocessed	(Balance, Exited)	0.1513	Pass
raw_dataset_preprocessed	(NumOfProducts, Exited)	-0.0660	Pass
raw_dataset_preprocessed	(HasCrCard, IsActiveMember)	-0.0527	Pass
raw_dataset_preprocessed	(NumOfProducts, IsActiveMember)	0.0468	Pass
raw_dataset_preprocessed	(Tenure, IsActiveMember)	-0.0366	Pass
raw_dataset_preprocessed	(Tenure, EstimatedSalary)	0.0338	Pass
raw_dataset_preprocessed	(CreditScore, IsActiveMember)	0.0324	Pass
raw_dataset_preprocessed	(CreditScore, EstimatedSalary)	-0.0276	Pass

2026-07-14 05:33:37,034 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Log tests with unique identifiers

Next, we'll use the previously initialized vm_balanced_raw_dataset (that still has a highly correlated Age column) as input to run an individual test, then log the result to the ValidMind Platform.

When running individual tests, you can use a custom result_id to tag the individual result with a unique identifier:

This result_id can be appended to test_id with a : separator.
The balanced_raw_dataset result identifier will correspond to the balanced_raw_dataset input, the dataset that still has the Age column.

result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation:balanced_raw_dataset",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)
result.log()

❌ High Pearson Correlation Balanced Raw Dataset

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify potentially redundant or highly collinear variable pairs. The results table reports the top 10 strongest correlations in the balanced raw dataset, showing each feature pair, its Pearson correlation coefficient, and whether it exceeds the configured absolute threshold of 0.3. Observed coefficients range from -0.1796 to 0.3465, with one pair classified as Fail and the remaining pairs classified as Pass.

Key insights:

One pair exceeds threshold: The pair (Age, Exited) has a Pearson correlation coefficient of 0.3465, which is above the configured threshold of 0.3 and is the only relationship marked as Fail.
All other listed correlations are modest: The remaining nine reported pairs have absolute correlation values below 0.18, with the largest among them being (Balance, NumOfProducts) at -0.1796.
Top relationships are mostly weakly negative or weakly positive: Among the reported top correlations, negative coefficients include (Balance, NumOfProducts) at -0.1796 and (IsActiveMember, Exited) at -0.1697, while positive coefficients beyond the leading pair include (Balance, Exited) at 0.1513.
Lower-ranked correlations are near zero: Several reported pairs have coefficients with small magnitudes, including (HasCrCard, IsActiveMember) at -0.0527, (NumOfProducts, IsActiveMember) at 0.0468, and (Tenure, EstimatedSalary) at 0.0338.

The reported correlation structure is concentrated in a single above-threshold relationship, with (Age, Exited) standing out as the only listed pair exceeding the 0.3 cutoff. All other reported feature pairs remain below the threshold and have relatively small magnitudes, indicating limited linear association among the other top-ranked relationships shown in the result set.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3465	Fail
(Balance, NumOfProducts)	-0.1796	Pass
(IsActiveMember, Exited)	-0.1697	Pass
(Balance, Exited)	0.1513	Pass
(NumOfProducts, Exited)	-0.0660	Pass
(HasCrCard, IsActiveMember)	-0.0527	Pass
(NumOfProducts, IsActiveMember)	0.0468	Pass
(Age, NumOfProducts)	-0.0463	Pass
(Tenure, IsActiveMember)	-0.0366	Pass
(Tenure, EstimatedSalary)	0.0338	Pass

2026-07-14 05:33:41,549 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:balanced_raw_dataset does not exist in model's document

Add test results to reporting

With some test results logged, let's head to the model we connected to at the beginning of this notebook and learn how to insert a test result into our validation report. (Learn more: Assess compliance)

While the example below focuses on a specific test result, you can follow the same general procedure for your other results:

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Validation under Documents.
Click on 2.2.1. Data Quality to expand that section.
Under the Class Imbalance Assessment guideline, click Evidence to expand the evidence panel.
Click Link Evidence, then select Validator Evidence.
Select the Class Imbalance test results we logged: ValidMind Data Validation Class Imbalance
Click Update Linked Evidence to add the test results to the validation report.
Confirm that the results for the Class Imbalance test you inserted has been correctly inserted into section 2.2.1. Data Quality of the report.
- Note that these test results are flagged as Requires Attention — as they include comparative results from our initial raw dataset.
- Click See evidence details to review the LLM-generated description that summarizes the test results, that confirm that our final preprocessed dataset actually passes our test:

Here in this text editor, you can make qualitative edits to the draft that ValidMind generated to finalize the test results.

Learn more: Work with content blocks

Preparing the preprocessed dataset

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing.

To start, let's grab the first few rows from the balanced_raw_no_age_df dataset we initialized earlier:

balanced_raw_no_age_df.head()

	CreditScore	Geography	Gender	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
1922	653	Germany	Female	2	158266.42	3	1	1	199357.24	0
4584	623	Spain	Male	1	83325.77	1	0	1	80828.78	0
1955	763	Spain	Female	10	95153.77	1	0	1	81310.10	0
1513	618	France	Male	7	0.00	1	1	1	142400.27	1
6996	727	France	Male	0	128213.96	2	1	1	188729.08	1

Before training the model, we need to encode the categorical features in the dataset:

Use the OneHotEncoder class from the sklearn.preprocessing module to encode the categorical features.
The categorical features in the dataset are Geography and Gender.

balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
1922	653	2	158266.42	3	1	1	199357.24	0	True	False	False
4584	623	1	83325.77	1	0	1	80828.78	0	False	True	True
1955	763	10	95153.77	1	0	1	81310.10	0	False	True	False
1513	618	7	0.00	1	1	1	142400.27	1	False	False	True
6996	727	0	128213.96	2	1	1	188729.08	1	False	False	True

Splitting our dataset into training and testing is essential for proper validation testing, as this helps assess how well the model generalizes to unseen data:

We start by dividing our balanced_raw_no_age_df dataset into training and test subsets using train_test_split, with 80% of the data allocated to training (train_df) and 20% to testing (test_df).
From each subset, we separate the features (all columns except "Exited") into X_train and X_test, and the target column ("Exited") into y_train and y_test.

from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

Initialize the split datasets

Next, let's initialize the training and testing datasets so they are available for use:

vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

In summary

In this second notebook, you learned how to:

Import a sample dataset
Identify which tests you might want to run with ValidMind
Initialize ValidMind datasets
Run individual tests
Utilize the output from tests you’ve run
Log test results as evidence to the ValidMind Platform
Insert test results into your validation report

Next steps

Develop potential challenger models

Now that you're familiar with the basics of using the ValidMind Library, let's use it to develop a challenger model: 3 — Developing a potential challenger